This is a methodological account of a preliminary data exploration and analysis which was carried out on a corpus of 500 novels. 250 of these texts are generally categorised as belonging to a genre called ‘realism’ and will be used in this context as a benchmark against which modernist literary style may be defined. The first novel in the naturalistic corpus, chronologically speaking, is Jane Austen’s novel ‘Lady Susan’, and was written in the year 1794. The final one is Thomas Hardy’s novel ‘Jude the Obscure’, published in 1895. This corpus contains the complete prose works, a phrase here encompassing novels, novellas and short story collections, of fifteen writers, Jane Austen, Emily, Anne and Charlotte Bronte, Stephen Crane, Honoré de Balzac, Charles Dickens, Fyodor Dostoevsky, George Eliot, Gustave Flaubert, Elizabeth Gaskell, Thomas Hardy, William Makepeace Thackeray, Leo Tolstoy and Émile Zola.

The corpus of 250 modernist novels begins in the year 1869, with Henry James’ first bloc of short stories, and continues all the way to Samuel Beckett’s 1988 novella ‘Stirrings Still’, so there is some overlap between these two corpora’s starting and end points. This modernist corpus otherwise consists of the complete works of nineteen writers including Djuna Barnes, Samuel Beckett, Jorge Luis Borges, Elizabeth Bowen, Joseph Conrad, William Faulkner, F. Scott FitzGerald, Ford Madox Ford, Ernest Hemingway, Henry James, James Joyce, Franz Kakfa, D.H. Lawrence, Katherine Mansfield, Flann O’Brien, Marcel Proust, Gertrude Stein, Edith Wharton and Virginia Woolf.

This disproportion between the two corpora, with fifteen realists versus ninteen modernists, may seem disconcerting at first, but what is required in order for the statistical analyses to function is for the number of observations to be equal, rather than the number of novelists. Unfortunately, realist authors wrote more novels than modernist authors, and this compromised our ability to retain the same number of authors on each end of the generic spectrum. The decision was therefore reached to expand the number of modernist texts where possible. Rather than treating each short story as a text in isolation, or as a single monolithic corpus, such as ‘The Complete Short Stories of Ford Madox Ford’ for example, an attempt was made to preserve the initial form in which the texts were published. FitzGerald, to give one example, published a number of short story collections, such as ‘Tales of the Jazz Age’ (1922) and the ‘Pat Hobby Stories’ (1941), but he also published many other texts in magazines throughout his career. These intervals were maintained in blocs of 1909 to 1917, 1920 to 1925, 1926 to 1934, 1935 to 1940 and stories published posthumously from 1940 onwards. This was likewise the case for James; his short stories were periodised into blocs of stories published from 1864 to 1869, 1870 to 1879, 1880 to 1889, 1890 to 1899, 1900 to 1909, and finally stories he published from 1910 onwards. One exception was Joseph Conrad’s novel ‘Heart of Darkness’, which was first published in three parts in Blackwood’s Magazine in 1899, and thereafter in 1902, as part of a publication entitled ‘Youth: A Narrative and two Other Stories’. This book also included a short autobiographical story which gives the collection its name, and Conrad’s novel ‘The End of the Tether’. It was decided that since ‘Heart of Darkness’ is usually studied as a novel in its own right, and as a significant modernist literary milestone, that quantifying its grammar apart from these other texts was justifiable, particularly if reaching the target of 250 separate modernist texts was required. This also allows us to do greater analytical justice to how a writer’s prose style may change over the course of their writing career, particularly in the context of their short prose fiction.

It was also necessary to make adjustments to reduce the number of realist texts. This was primarily due to Balzac, as he is responsible for authoring over one hundred prose works. Fortunately, Balzac’s writings were published in thematic cycles, such as, for example, ‘Scenes from Parisian Life’, ‘Scenes from Political Life’ and ‘Catherine di Medici’. The length and category of the texts in each of these groups vary widely, some are novels, some are novellas, many are short stories, and the decision was made to collapse these short texts according to these broader groupings, with the result that Balzac’s short stories were the category most effected by this attenuation of the number of realist texts. This was done in lieu of removing a large number of Balzac’s works from the corpus altogether, for the reason that this would require making assumptions about which of his writings are more representative or canonical and therefore more worthy of inclusion in the corpus.

One final aspect to consider is the international dimension. The realist corpus includes ten novelists who wrote in English, but there are also two Russian and three French realists, two of whom, Zola and the aforementioned Balzac, were far more prolific than any other writer in either corpus. Zola and Balzac composed 86 and 34 novels, short story collections or novellas respectively. This has the consequence that well over half of the realist corpus is in translation from another language in comparison to just under 10% of the modernist corpus. I intend to address this when I am at a later stage in my research, there has been some work published in digital humanities journals on the issues surrounding the quantification of literature in translation and across language, but I do not yet possess a sufficient breadth of knowledge in this field to comment intelligently on the matter as yet. I do think it is important to have French and Russian writers included in the realist corpus on the basis that many of them, be they Tolstoy, Flaubert or Balzac, exerted a significant influence on their modernist successors, including James, Woolf or Joyce, amongst others.

With regard to the issue of translation, as with so much else in quantitative analysis, we must turn to the issue of what data is available for us to capture. It was very straightforward to find the complete works of these novelists in a relatively clean format, the problem is I could only find one of each which was out of copyright. Whether or not these are ‘the best’ or most accurate translations is sort of beside the point, from the reading I have done around the issue of literary translation, it’s clear that translations change over time, this is in the nature of how text is received and re-constituted in different eras for different communities of readers, but the most germane point here is that the translations being analysed in this instance could not be considered to be the most contemporary. There might be an argument for retaining these older translations on the basis that they are more likely to be the versions of the text which would have been circulating in the early twentieth century and therefore the translations modernist authors would have been more likely to have read, but making this claim would require a greater burden of proof, such as what languages each author read novels in and what their reading habits were more generally. The point is, there are translations in this corpus which would not be regarded as the standard English versions of the text today, such as C.K. Scott Moncrieff’s ‘Remembrance of Things Past’, his translation of Marcel Proust’s ‘Á la récherche du temps perdu’, published between 1922 and 1931, which has since been superseded, first by D.J. Enright’s in 1992, and then by Lydia Davis and others in 2002.

So, to turn to the analysis itself. My research is directed towards the quantitative analysis of grammar, the rationale being that we could, by examining the varying quantities of particular categories of words, such as verbs, adjectives or prepositions, develop an understanding of how literary fiction changes from the beginning of the nineteenth century until the end of the twentieth, and, more specifically, how literary modernism departs from, or, perhaps even remains contiguous with, this previous generation of novel writing. So determining the stylistic relationship between these two genres based on grammar is what I am interested in doing.

Though it might seem as though we command a fairly broad overview of two centuries of literary history, considering we are analysing 500 novels from 1796 to 1988, it should be noted that this is by no means a prospect for a definitive answer on the matter of modernist literary style; there are only five hundred observations for each variable, which would bring us to just under 2.6 books per year, a tiny, tiny percentage of what is actually published in this time period.

These texts were obtained in .mobi and .azw formats and converted into .txt files using an ebook management software called Calibre. Front and back matter were removed, such as introductions, forewords, afterwords, followed by titles, chapter names and section divisions. The corpora were then split up into separate novels, and read into the Python workspace. The data was inspected thoroughly at each stage in the process to ensure that it was not becoming corrupted or being made more dirty rather than less.

Now, the programming language Python, contains a natural langauge processing library called Natural Language Toolkit, or NLTK, and this is the library that we obtain our Part of Speech or POS tagger from. Once each .txt file was read into the workspace and tokenized, the POS tagger looped through each one, and assigned each word to one of thirty-five different grammatical categories. These were then pared down to twenty-nine because some of them were a bit too specific to register in the text as existing on any scale which could be considered significant, such as ‘foreign words’, list objects and exclamations. As such, we quantified coordinating conjunctions, determiners, existentials, prepositions, adjectives, modals, nouns, predeterminers, pronouns, adverbs, particles, the word ‘to’ and verbs. It should be noted that adjectives and adverbs are split into three separate categories of normal, comparative and superlative, and that nouns can be singular or plural. Verbs can also be base, past tense, gerunds, past participles, third and non-third person singular presents. Finally, we quantified the number of full stops and divided this by the number of words in each text in order to reach an average quantity of sentence length. And this was the first variable which was quantified. Just before we see that though, we should note that quantifying sentence length presents us with a unique difficulty related to Beckett’s 1964 novel ‘How It Is’, published initially as ‘Comment C’est’ in 1961.

‘How It Is’ contains no punctuation, full stops or commas, so the question as to which way in which we might quantify the average length of the sentence is something of a conundrum. The first impulse might be to say just count all 36347 of its words as one big sentence, but this of course would make it a major outlier, and render the boxplot useless, and seriously skew the data. Plotting it as zero would be less detrimental to the analysis but also inaccurate, how could we say that average sentence length is zero words? So we exclude ‘How It Is’ as a major outlier, and also Anne Bronte’s novel ‘Agnes Grey’ as that is the higher realism outlier for sentence length, which is again, not ideal, but the analysis won’t work otherwise.

realism <- read.csv("realism.csv", stringsAsFactors = FALSE)
modernism <- read.csv("modernism.csv", stringsAsFactors = FALSE)

centuries <- read.csv("centuries.csv", stringsAsFactors = FALSE)

centuries$Slength[322] <- 0
centuries$Slength <- as.numeric(centuries$Slength)
attach(centuries)

modernism$Slength[72] <- 0
  
modernismsentencelengths <- modernism$Slength[-72]
modernismsentencelengths <- as.numeric(modernismsentencelengths)

realismsentencelengths <- realism$Slength[-19]

hist(realismsentencelengths, main = "Sentence Length in Realist Novels", xlab = "Sentence Length (Words)")

hist(modernismsentencelengths, main = "Sentence Length in Modernist Novels", xlab = "Sentence Length (Words)")

var(realismsentencelengths)
## [1] 14.29844
var(modernismsentencelengths)
## [1] 37.06567
var.test(realismsentencelengths, modernismsentencelengths)
## 
##  F test to compare two variances
## 
## data:  realismsentencelengths and modernismsentencelengths
## F = 0.38576, num df = 248, denom df = 248, p-value = 1.866e-13
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3005835 0.4950720
## sample estimates:
## ratio of variances 
##          0.3857596
wilcox.test(realismsentencelengths, modernismsentencelengths)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realismsentencelengths and modernismsentencelengths
## W = 45263, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realismsentencelengths, modernismsentencelengths, notch = TRUE, main = "Sentence Length in Realist versus Modernist Novels", names = c("Realism", "Modernism"), ylab = "Sentence Length (Words)", xlab = "Genre")

median(realismsentencelengths)
## [1] 22.30144
median(modernismsentencelengths)
## [1] 18.44173
median(realismsentencelengths) - median(modernismsentencelengths)
## [1] 3.859712

The variances are not significantly different from one another, but while the realist data is fairly evenly distributed, the modernist data is highly peaked. It will therefore be necessary to test the means using non-parametric statistical measurements, which do not make assumptions about the way the data is distributed, and this is why the Mann-Whitney test was used. The p-value is less than 2.2e-16, demonstrating that the decline of sentence length from 22.3 words to 18.4 is statistically significant.

When we look at this first boxplot, which displays the difference in variation in the sentence lengths in realist versus modernist literature, we can see that there are two higher realism outleirs for sentence length and these are William Makepeace Thackeray’s novels ‘The Luck of Barry Lyndon’ (1844) and ‘Rebecca and Rowena’ (1850) at 33.1 and 32.2 words per sentence respectively. There is also one lower realism outlier, Stephen Crane’s 1896 novel ‘George’s Mother’ which has an average of 12.1 words per sentence.

There are a large number, about sixteen modernist outliers for sentence length including William Faulkner’s ‘Absalom, Absalom!’ (1936) at 46.4 and ‘Intruder in the Dust’ (1948) at 42.3, Marcel Proust’s ‘Swann’s Way’ (1913) at 42.9, ‘In a Budding Grove’ (1919) at 40.2, ‘Time Re-gained’ (1927) at 38, ‘The Prisoner’ (1923) at 37.2 ‘The Fugitive’ (1925) at 35.7, ‘The Guermantes Way’ (1921) at 34.1 and ‘Sodom and Gomorrah’ (1922) at 30.9. Samuel Beckett’s ‘Texts for Nothing’ (1955) at 40.5 and ‘The Unnamable’ (1955) at 32.9, Gertrude Stein’s ‘The Making of Americans’ (1925) at 33.9 and ‘Everybody’s Autobiography’ (1937) at 33.5, Henry James’ unpublished novel ‘The Ivory Tower’, published posthumously in 1917 at 31.8, his 1910 short stories at 35.2 and Ford Madox Ford’s novel ‘The Young Lovell’ (1913) at 29.

And there are also two Beckett texts which are low outliers for sentence length with ‘Worstward Ho’ (1986) at 4.9, ‘Ill Seen Ill Said’ (1981) with 7.

hist(realism$Personal.Pronoun, main = "Personal Pronouns in Realist Novels", xlab = "Personal Pronouns in Realist Novels (%)")

hist(modernism$Personal.Pronoun, main = "Personal Pronouns in Modernist Novels", xlab = "Personal Pronouns in Modernist Novels (%)")

var(realism$Personal.Pronoun)
## [1] 0.7873269
var(modernism$Personal.Pronoun)
## [1] 1.261971
var.test(realism$Personal.Pronoun, modernism$Personal.Pronoun)
## 
##  F test to compare two variances
## 
## data:  realism$Personal.Pronoun and modernism$Personal.Pronoun
## F = 0.62389, num df = 249, denom df = 249, p-value = 0.0002143
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4863769 0.8002738
## sample estimates:
## ratio of variances 
##          0.6238868
wilcox.test(realism$Personal.Pronoun, modernism$Personal.Pronoun)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Personal.Pronoun and modernism$Personal.Pronoun
## W = 15117, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Personal.Pronoun, modernism$Personal.Pronoun, notch = TRUE, main = "Personal Pronouns in Realist versus Modernist Novels", ylab = "Personal Pronouns (%)", xlab = "Genre", names = c("Realism", "Modernism"))

median(realism$Personal.Pronoun)
## [1] 5.364828
median(modernism$Personal.Pronoun)
## [1] 6.436688
median(realism$Personal.Pronoun) - median(modernism$Personal.Pronoun)
## [1] -1.071859

The realist data is fairly evenly distributed, but the modernist data is highly peaked. The variances are also significantly different, exhibiting an increase of 0.5, from 0.79 to 1.26. Therefore, as above, we must make use of the non-parametric Mann-Whitney test, as it does not make assumptions about the ways in which the data is distributed.

The result of the test is that personal pronouns increase by 1%, from 5.4% in realism to 6.4% in modernism, and that this increase is significant.

There are three higher realism outliers, such as Fyodor Dostoevsky’s short story ‘White Nights’ published in 1848 (7.7%) and his 1861 novel, ‘The Insulted and Humiliated’ (8.5%). Emile Zola’s novel ‘A Dead Woman’s Wish’ (1890) also appears at 8%.

There are three lower modernism outliers such as Beckett’s Stories (3%), ‘Worstward Ho’ (1.2%) and Stein’s 1914 novel ‘Tender Buttons’ (3%).

hist(realism$Verb.past.tense, main = "Past Tense Verbs in Realist Novels", xlab = "Past Tense Verbs in Realist Novels (%)")

hist(modernism$Verb.past.tense, main = "Past Tense Verbs in Modernist Novels", xlab = "Past Tense Verbs in Modernist Novels (%)")

var(realism$Verb.past.tense)
## [1] 1.160141
var(modernism$Verb.past.tense)
## [1] 1.691106
var.test(realism$Verb.past.tense, modernism$Verb.past.tense)
## 
##  F test to compare two variances
## 
## data:  realism$Verb.past.tense and modernism$Verb.past.tense
## F = 0.68602, num df = 249, denom df = 249, p-value = 0.003061
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5348193 0.8799797
## sample estimates:
## ratio of variances 
##          0.6860249
wilcox.test(realism$Verb.past.tense, modernism$Verb.past.tense)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Verb.past.tense and modernism$Verb.past.tense
## W = 17418, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Verb.past.tense, modernism$Verb.past.tense, notch = TRUE, main = "Past Tense Verbs in Realist versus Modernist Novels", xlab = "Genre", ylab = "Past Tense Verbs (%)", names = c("Realism", "Modernism"))

median(realism$Verb.past.tense)
## [1] 5.566386
median(modernism$Verb.past.tense)
## [1] 6.52244
median(realism$Verb.past.tense) - median(modernism$Verb.past.tense)
## [1] -0.9560534

The realist past tense verb data is non-normally distributed and the modernist data is skewed to the right. The variation also seems significantly different in its increase from 1.16 to 1.69, and the p-values for this are just above zero. This, as before, indicates that we must utilise the non-parametric Mann-Whitney test in assessing the median difference between the two variables.

The Mann-Whitney test informs us that there is a significant increase in past tense verbs from realism to modernism, from 5.5% to 6.5%.

The lower realism outlier is Balzac’s 1841 novel ‘Letters of Two Brides’ (2.7%)

and one higher modernism outlier is Virginia Woolf’s 1937 novel ‘The Years’ (10%)

There are, despite the overall increase, a large number of lower modernist outliers, including James Joyce’s 1922 novel ‘Ulysses’ (4.3%) and 1939 novel ‘Finnegans Wake’ (2.7%), William Faulkner’s 1930 novel ‘As I Lay Dying’ (4.2%) and ‘Requiem for a Nun’ (1951) (3.6%), Samuel Beckett’s ‘Malone Dies’ (1951) (3.9%), ‘Fizzles’ (1976) (2.5%), ‘Company’ (1979) (2%) ‘Texts for Nothing’ (1.8%), The Unnamable (1.7%), Worstward Ho (1.6%) Ill Seen Ill Said (1.4%), Beckett Stories (2.2%) and How It Is (2.1%). Conrad and Ford Madox Ford’s collaborative novel ‘The Nature of a Crime’ (1909) (2.6%), Woolf’s ‘The Waves’ (1931) (2.4%) and Stein’s ‘Tender Buttons’ (1.7%)

hist(realism$Adverb, main = "Distribution of Adverbs in Realist Novels", xlab = "Realist Adverbs (%)")

hist(modernism$Adverb, main = "Distribution of Adverbs in Modernist Novels", xlab = "Modernist Adverbs (%)")

var(realism$Adverb)
## [1] 0.6905353
var(modernism$Adverb)
## [1] 1.263961
var.test(realism$Adverb, modernism$Adverb)
## 
##  F test to compare two variances
## 
## data:  realism$Adverb and modernism$Adverb
## F = 0.54633, num df = 249, denom df = 249, p-value = 2.252e-06
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4259115 0.7007853
## sample estimates:
## ratio of variances 
##          0.5463264
wilcox.test(realism$Adverb, modernism$Adverb)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Adverb and modernism$Adverb
## W = 17881, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
median(realism$Adverb)
## [1] 4.470327
median(modernism$Adverb)
## [1] 5.06117
median(realism$Adverb) - median(modernism$Adverb)
## [1] -0.5908429
#p-value < 2.2e-16, significant 0.6% increase in adverbs from 4.5% to 5.1%

boxplot(realism$Adverb, modernism$Adverb, notch = TRUE, main = "Adverbs in Realist Versus Modernist Novels", ylab = "Adverbs (%)", xlab = "Genre", names = c("Realism", "Modernism"))

####The realist data is non-normally distributed, and is noteworthy for its skewing left and the modernist data is highly peaked. Even if the variance between the two is not significantly different, it is clearly different enough that we require non-parametric measurements in order to assess the differences between the two.

There are a large number of higher outliers amongst the modernist data including Beckett’s Short Fiction (8.3%) ‘Stirrings Still’ (12.5%), ‘Worstward Ho’ (12.2%), ‘Fizzles’ (8.9%), ‘Three Lives’ (10%), ‘Ill Seen Ill Said’ (8%), ‘How It Is’ (7.6%), ‘Company’ (7.1%) Stein’s ‘The Making of Americans’ (8.8%), ‘Everybody’s Autobiography’ (8.8%), ‘The Autobiography of Alice B. Toklas’ (7.7%), Faulkner’s ‘Intruder in the Dust’, (7.8%) ‘The Mansion’, (7.2%) ‘A Fable’, (7.1%) ‘Absalom!, Absalom!’ (7%) and Kafka’s Stories (7.3%)

hist(realism$Preposition, main = "Distribution of Prepositions in Realist Novels", xlab = "Prepositions (%)")

hist(modernism$Preposition, main = "Distribution of Prepositions in Modernist Novels", xlab = "Prepositions (%)")

var(realism$Preposition)
## [1] 0.5668526
var(modernism$Preposition)
## [1] 1.466232
var.test(realism$Preposition, modernism$Preposition)
## 
##  F test to compare two variances
## 
## data:  realism$Preposition and modernism$Preposition
## F = 0.38661, num df = 249, denom df = 249, p-value = 1.887e-13
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3013941 0.4959071
## sample estimates:
## ratio of variances 
##           0.386605
median(realism$Preposition)
## [1] 10.89971
median(modernism$Preposition)
## [1] 10.46074
median(realism$Preposition) - median(modernism$Preposition)
## [1] 0.4389673
wilcox.test(realism$Preposition, modernism$Preposition)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Preposition and modernism$Preposition
## W = 37345, p-value = 0.0001614
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Preposition, modernism$Preposition, notch = TRUE, main = "Percentage of Prepositions in Realist versus Modernist Novels", xlab = "Genre", ylab = "Prepositions (%)", names = c("Realism", "Modernism"))

The realist data is skewed slightly to the right, but not totally abnormal. However, the modernist data is highly peaked, and while the variation is not significantly different, it is preferable to use the non-parametric measurements to assess the difference between the means to avoid making assumptions about the ways in which the data is distributed. The Mann-Whitney test reveals there there is a 0.4% decrease in prepositions from 10.9% to 10.5%, and that this decrease is significant.

There are two lower realism outliers, Crane’s novel ‘The Third Violet’ (8.3%) and Thackeray’s ‘Lovel the Widower’ (8.7%).

There is one higher modernism outlier, Beckett’s ‘Stirrings Still’ (14%), and one lower modernism outlier, ‘Tender Buttons’ (7.1%).

hist(realism$wh.determiner, main = "Distribution of wh Determiners in Realist Novels", xlab = "Wh Determiners (%)")

hist(modernism$wh.determiner, main = "Distribution of wh Determiners in Modernist Novels", xlab = "Wh Determiners (%)")

var(realism$wh.determiner)
## [1] 0.03535094
var(modernism$wh.determiner)
## [1] 0.04009686
var.test(realism$wh.determiner, modernism$wh.determiner)
## 
##  F test to compare two variances
## 
## data:  realism$wh.determiner and modernism$wh.determiner
## F = 0.88164, num df = 249, denom df = 249, p-value = 0.3209
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6873181 1.1308980
## sample estimates:
## ratio of variances 
##          0.8816386
wilcox.test(realism$wh.determiner, modernism$wh.determiner)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$wh.determiner and modernism$wh.determiner
## W = 48358, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$wh.determiner, modernism$wh.determiner, notch = TRUE, main = "Distribution of Wh Determiners in Realist versus Modernist Novels", xlab = "Genre", ylab = "Wh Determiners (%)", names = c("Realism", "Modernism"))

median(realism$wh.determiner)
## [1] 0.5749586
median(modernism$wh.determiner)
## [1] 0.390283
median(realism$wh.determiner) - median(modernism$wh.determiner)
## [1] 0.1846756

The realism data is highly peaked and the modernist data is skewed left. While the variation in wh determiners increase from realism to modernism, the increase is not significant. All the same, due to the unevenness in the distribution revealed in the histograms, we must a non-parametric analysis in order to assess the significance of the mean difference.

The Mann-Whitney test reveals that the decrease from 0.59% to 0.43% is significant, and the p-values are tiny.

There are two high realism outliers, Balzac’s novels ‘Seraphita’ (1834) and ‘Scenes from Parisian Life’.

There are ten high modernism outliers, including James’ ‘The Ivory Tower’ (0.7%) and ‘The Sense of the Past’ (0.9%) Stein’s ‘Tender Buttons’ (1%), Faulkner’s ‘Absalom! Absalom!’ (1%), Proust’s ‘Sodom and Gomorrah’ (1.1%), ‘Time Re-gained’ (1.2%), ‘The Guermantes Way’ (1.2%), ‘The Prisoner’ (1.3%), ‘Swann’s Way (1.4%) The Fugitive’ (1.5%) and ‘In a Budding Grove’ (1.5%).

hist(realism$Particle, main = "Distribution of Particles in Realist Novels", xlab = "Particles (%)")

hist(modernism$Particle, main = "Distribution of Particles in Modernist Novels", xlab = "Particles (%)")

var(realism$Particle)
## [1] 0.0172133
var(modernism$Particle)
## [1] 0.02025903
var.test(realism$Particle, modernism$Particle)
## 
##  F test to compare two variances
## 
## data:  realism$Particle and modernism$Particle
## F = 0.84966, num df = 249, denom df = 249, p-value = 0.1993
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6623882 1.0898788
## sample estimates:
## ratio of variances 
##          0.8496604
wilcox.test(realism$Particle, modernism$Particle)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Particle and modernism$Particle
## W = 17119, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Particle, modernism$Particle, notch = TRUE, main = "Particles in Realist versus Modernist Novels", xlab = "Genre", ylab = "Particles (%)", names = c("Realism", "Modernism"))

median(realism$Particle)
## [1] 0.3975158
median(modernism$Particle)
## [1] 0.5128835
median(realism$Particle) - median(modernism$Particle)
## [1] -0.1153677

The realism data is highly peaked and also skewed left, whereas the modernist data is highly peaked. Though there is an increase in the variation from realism to modernism, the increase is not significant, and therefore, we make use of the Mann-Whitney test as opposed to the student t-test.

The test has the result that we find the 0.1% increase of particles from 0.4% to 0.5% is significant.

There is one high modernism outlier, Ernest Hemingway’s ‘The Sun Also Rises’ (1926) (0.9%) and one high realism outlier, George Eliot’s ‘The Attack on the Mill’ (1%).

hist(realism$Verb.non.3rd.person.singular.present, main = "Non third-person singular present verbs in realist novels", xlab = "Non third-person singular present verbs (%)")

hist(modernism$Verb.non.3rd.person.singular.present, main = "Non third-person singular present verbs in modernist novels", xlab = "Non third-person singular present verbs (%)")

var(realism$Verb.non.3rd.person.singular.present)
## [1] 0.226187
var(modernism$Verb.non.3rd.person.singular.present)
## [1] 0.2548945
var.test(realism$Verb.non.3rd.person.singular.present, modernism$Verb.non.3rd.person.singular.present)
## 
##  F test to compare two variances
## 
## data:  realism$Verb.non.3rd.person.singular.present and modernism$Verb.non.3rd.person.singular.present
## F = 0.88737, num df = 249, denom df = 249, p-value = 0.3464
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6917901 1.1382561
## sample estimates:
## ratio of variances 
##          0.8873749
wilcox.test(realism$Verb.non.3rd.person.singular.present, modernism$Verb.non.3rd.person.singular.present)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Verb.non.3rd.person.singular.present and modernism$Verb.non.3rd.person.singular.present
## W = 34747, p-value = 0.03042
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Verb.non.3rd.person.singular.present, modernism$Verb.non.3rd.person.singular.present, notch = TRUE, main = "Non Third Person Singular Present Verbs in Realism Versus Modernism", xlab = "Genre", ylab = "Non Third Person Singular Present Verbs (%)", names = c("Realism", "Modernism"))

median(realism$Verb.non.3rd.person.singular.present)
## [1] 1.581927
median(modernism$Verb.non.3rd.person.singular.present)
## [1] 1.458354
median(realism$Verb.non.3rd.person.singular.present) - median(modernism$Verb.non.3rd.person.singular.present)
## [1] 0.1235726

The realist and modernist data are both skewed left, and though variation increases, it is a non-significant increase. Because the data is skewed, we use a non-parametric test, the Mann-Whitney.

The result is that the 0.08% decrease in non third-person singular present verbs, from 1.63% to 1.55% is significant.

There are four high realism outliers, Dostoevsky’s ‘White Nights’ (3.3%), ‘Poor Folk’ (3.0%), ‘Uncle’s Dream’ (3%) ‘The Village of Stepanchikovo’ (3%), and seven high modernism outliers, Woolf’s ‘The Waves’ (3.8%), Ford and Conrad’s ‘The Nature of a Crime’, (3.8%) Beckett’s, ‘The Unnamable’ (3.6%), ‘How It Is’ (3.2%), ‘Texts for Nothing’ and Hemingway’s ‘Across the River and into the Trees’ (3%) and Faulkner’s (3%) ‘The Sound and the Fury’ (2.7%).

hist(realism$Existential, main = "Existentials in Realist Novels", xlab = "Existentials (%)")

hist(modernism$Existential, main = "Existentials in Modernist Novels", xlab = "Existentials (%)")

var(realism$Existential)
## [1] 0.003356196
var(modernism$Existential)
## [1] 0.01716611
var.test(realism$Existential, modernism$Existential)
## 
##  F test to compare two variances
## 
## data:  realism$Existential and modernism$Existential
## F = 0.19551, num df = 249, denom df = 249, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1524202 0.2507888
## sample estimates:
## ratio of variances 
##          0.1955129
wilcox.test(realism$Existential, modernism$Existential)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Existential and modernism$Existential
## W = 17183, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Existential, modernism$Existential, notch = TRUE, xlab = "Genre", ylab = "Existentials (%)", names = c("Realism", "Modernism"))

median(realism$Existential)
## [1] 0.1655106
median(modernism$Existential)
## [1] 0.2081313
median(realism$Existential) - median(modernism$Existential)
## [1] -0.0426207

Both the realism and the modernism data are non-normally distributed, the former is highly peaked and the latter is highly skewed towards the left, or towards the lower values. Though the variation increases, this is not a significant increase.

As the distribution is non-normal across both data sets, the Mann-Whitney test was used to assess the extent of the changes.

The result of this test revealed that the mean increase of 0.06% in existentials from 0.17% to 0.23% was significant.

There are two higher realism outliers, Crane’s ‘The Little Regiment’ (0.35%) and ‘The Red Badge of Courage’ (0.34%), and ‘His Masterpiece’, a novel by Zola (0.31%).

There are seven higher modernism outliers, Hemingway’s ‘For Whom the Bell Tolls’ (0.4%), Beckett’s ‘The Unnamable’ (0.4%) and ‘Texts for Nothing’ (0.6%) and Stein’s novels ‘The Making of Americans’ (0.6%), ‘The Autobiography of Alice B. Toklas’ (0.7%), ‘Everybody’s Autobiography’ (0.8%) and ‘Tender Buttons’ (2%).

hist(realism$Possessive.wh.pronoun, main = "Possessive Wh Pronouns in Realist Novels", xlab = "Wh Pronouns (%)")

hist(modernism$Possessive.wh.pronoun, main = "Possessive Wh Pronouns in Modernist Novels", xlab = "Wh Pronouns (%)")

var(realism$Possessive.wh.pronoun)
## [1] 0.0006894213
var(modernism$Possessive.wh.pronoun)
## [1] 0.0002189903
var.test(realism$Possessive.wh.pronoun, modernism$Possessive.wh.pronoun)
## 
##  F test to compare two variances
## 
## data:  realism$Possessive.wh.pronoun and modernism$Possessive.wh.pronoun
## F = 3.1482, num df = 249, denom df = 249, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  2.454297 4.038245
## sample estimates:
## ratio of variances 
##           3.148182
wilcox.test(realism$Possessive.wh.pronoun, modernism$Possessive.wh.pronoun)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Possessive.wh.pronoun and modernism$Possessive.wh.pronoun
## W = 47304, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Possessive.wh.pronoun, modernism$Possessive.wh.pronoun, main = "Possessive Wh Pronouns in Realist Versus Modernist Novels", xlab = "Genre", names = c("Realism", "Modernism"), notch = TRUE)

median(realism$Possessive.wh.pronoun)
## [1] 0.03271143
median(modernism$Possessive.wh.pronoun)
## [1] 0.01770989
median(realism$Possessive.wh.pronoun) - median(modernism$Possessive.wh.pronoun)
## [1] 0.01500155

Both realist and modernist data skew left, and variation decreases significantly. As a result, we use the Mann-Whitney test to assess the significance of the difference, and find that the 0.02% decrease, from 0.04% to 0.02%, is significant.

There are three high realism outliers, Balzac’s ‘Sarrasine’, ‘Seraphita’ and ‘Scenes from Parisian Life’.

There are a large number of high modernism outliers, including Beckett’s ‘Worstward Ho’ (0.09%), ‘Watt’ (0.06%), ‘Texts for Nothing’ (0.05%), ‘First Love’ (0.05%), Proust’s ‘Swann’s Way’ (0.09%), ‘Time Re-gained’ (0.07%), ‘In a Budding Grove’ (0.07%), ‘The Guermantes Way’ (0.06%), ‘Sodom and Gomorrah’ (0.06%), ‘The Prisoner’ (0.05%), ‘The Fugitive’ (0.05%), Borges’ ‘Labyrinths’ (0.08%) Ford Madox Ford’s ‘The Secret Agent’ (0.06%), Edith Wharton’s ‘The Valley of Decision’ (0.06) and Faulkner’s ‘Absalom! Absalom!’ (0.05%).

hist(realism$Adjective.Superlative, main = "Superlative Adjectives in Realist Novels", xlab = "Superlative Adjectives (%)")

hist(modernism$Adjective.Superlative, main = "Superlative Adjectives in Modernist Novels", xlab = "Superlative Adjectives (%)")

var(realism$Adjective.Superlative)
## [1] 0.002045467
var(modernism$Adjective.Superlative)
## [1] 0.003396895
var.test(realism$Adjective.Superlative, modernism$Adjective.Superlative)
## 
##  F test to compare two variances
## 
## data:  realism$Adjective.Superlative and modernism$Adjective.Superlative
## F = 0.60216, num df = 249, denom df = 249, p-value = 6.982e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4694373 0.7724017
## sample estimates:
## ratio of variances 
##          0.6021579
wilcox.test(realism$Adjective.Superlative, modernism$Adjective.Superlative)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  realism$Adjective.Superlative and modernism$Adjective.Superlative
## W = 35867, p-value = 0.004265
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Adjective.Superlative, modernism$Adjective.Superlative, notch = TRUE, main = "Superlative Adjectives in Realist Versus Modernist Novels", xlab = "Genre", ylab = "Adjective Superlatives (%)", names = c("Realism", "Modernism"))

median(realism$Adjective.Superlative)
## [1] 0.1359699
median(modernism$Adjective.Superlative)
## [1] 0.1282673
median(realism$Adjective.Superlative) - median(modernism$Adjective.Superlative)
## [1] 0.007702564

The realism and modernism data is unevenly distributed, the realism is highly peaked, the modernism is highly skewed to the left. The variation shows a non-significant increase and as a result, we use the Mann-Whitney test to assess the significance of the difference. This 0.01% decrease from 0.14% to 0.13% is thereafter found to be significant.

There are three higher realism outliers, ‘An Episode Under the Terror’ (0.3%), ‘The Ball at Sceaux’ (0.3%) both texts by Balzac, and Jane Austen’s ‘Mansfield Park’ (0.3%)

There are seven higher modernism outliers, Beckett’s ‘Worstward Ho’ (0.6%), and his stories (0.3%) Henry James’ ‘The Outcry’ (0.3%), ‘The Wing of the Dove’ (0.2%) ‘The Sense of the Past’ (0.4%) and ‘The Ivory Tower’ (0.4%), and William Faulkner’s The Mansion (0.3%).

As we have seen this far, boxplots are very useful visualisations for the identification of outliers, which are data points with values which are greater than 1.5 times the interquartile range above the third quartile, or more than 1.5 times below the first quartile. One thing that we will have noticed in comparing realist versus modernist corpora, is not just that modernism seems to increase the variation for each observation (this takes place in all but one instance, and to a degree that is statistically significant in three of these) but modernism also generates far more statistical outliers, and that in instances where the variation is significantly increases, it is a number of similar authors causing it.

It’s tempting, I think, in seeking explanations for what might be happening here to reach for established explanations from within the literature, explanations such as the phenomenon of the avant-garde, a term more frequently used to delineate a number of sects of praxis-oriented, confrontational and improvisatory artists, usually working in opposition to more established modernists. As Douglas Mao and Rebecca L. Walkowitz note ‘By the end of the century…’modernism’ could be used in a way that ‘avant-garde’ could not: to suggest a persistent orthodoxy rather than a deliberate challenge’ (6). The presumption of the avant-garde is that how modernism, represented by T.S. Eliot and the new critics, institutional affiliations, supported by a university culture were therefore far more contiguous with tradition than not.

Gilbert and Gubar have conceptualised female writers as ‘the avant-garde of the avant-garde’, on the basis of ‘their problematic relationship to the tradition of authority’ (Mao and Walkowitz: 8).

‘The manifesto violated almost all values associated with high modernism by privileging the explicit rather than the implicit, easy slogans rather than complex structures, shrill raptures rather than subtle epiphanies, direct address of the audience rather than unreliable narratives, political interventionism rather than aesthetic autonomy, and finally revolution rather than transcendence’ (Puchner 108). More of a part with the age of media re-mediation, newspapers, adevertising, propaganda.

The degree to which this disjunction is emphasised varies within the literature, there are some who insist that the difference is absolute, and that we should not refer to any modernist novelists or even poets as avant-garde in the same sense as the Dadaists or the surrealists, but nonetheless I think it’s worth quantifying the extent of these outlier values to see what we might come up with.

anepisodeundertheterrorH <- realism$Adjective.Superlative[221] - as.numeric(summary(realism$Adjective.Superlative)[5])
anepisodeundertheterror <- anepisodeundertheterrorH * anepisodeundertheterrorH
theattackonthemillH <- max(realism$Particle)
theattackonthemill <- theattackonthemillH * theattackonthemillH
theballatsceauxH <- realism$Adjective.Superlative[166] - as.numeric(summary(realism$Adjective.Superlative)[5])
theballatsceaux <- theballatsceauxH * theballatsceauxH
thedeadwomanswishH <- realism$Personal.Pronoun[21] - as.numeric(summary(realism$Personal.Pronoun)[5])
thedeadwomanswish <- thedeadwomanswishH * thedeadwomanswishH
hismasterpieceH <- realism$Existential[36] - as.numeric(summary(realism$Existential)[5])
hismasterpiece <- hismasterpieceH * hismasterpieceH
theinsultedandhumiliatedH <- realism$Personal.Pronoun[112] - as.numeric(summary(realism$Personal.Pronoun)[5])
theinsultedandhumiliated <- theinsultedandhumiliatedH * theinsultedandhumiliatedH
thelittleregimentH <- max(realism$Existential) - as.numeric(summary(realism$Existential)[5])
thelittleregiment <- thelittleregimentH * thelittleregimentH
lettersoftwobridesL <- realism$Verb.past.tense[167] - as.numeric(summary(realism$Verb.past.tense)[2])
lettersoftwobrides <- lettersoftwobridesL * lettersoftwobridesL
lovelthewidowerL <- realism$Preposition[135] - as.numeric(summary(realism$Preposition)[5])
lovelthewidower <- lovelthewidowerL * lovelthewidowerL
mansfieldparkH <- realism$Adjective.Superlative[61] - as.numeric(summary(realism$Adjective.Superlative)[5])
mansfieldpark <- mansfieldparkH * mansfieldparkH
poorfolkH <- realism$Verb.non.3rd.person.singular.present[107] - as.numeric(summary(realism$Verb.non.3rd.person.singular.present)[5])
poorfolk <- poorfolkH * poorfolkH
seraphitaH <- realism$wh.determiner[248] - as.numeric(summary(realism$wh.determiner)[5])
seraphita <- seraphitaH * seraphitaH
theredbadgeofcourageH <- realism$Existential[3] - as.numeric(summary(realism$Existential)[5])
theredbadgeofcourage <- theredbadgeofcourageH * theredbadgeofcourageH
thethirdvioletL <- realism$Preposition[5] - as.numeric(summary(realism$Preposition)[5])
thethirdviolet <- thethirdvioletL * thethirdvioletL
unclesdreamH <- realism$Verb.non.3rd.person.singular.present[100] - as.numeric(summary(realism$Verb.non.3rd.person.singular.present)[5])
unclesdream <- unclesdreamH * unclesdreamH
villageofstepanchikovoH <- realism$Verb.non.3rd.person.singular.present[111] - as.numeric(summary(realism$Verb.non.3rd.person.singular.present)[5])
villageofstepanchikovo <- villageofstepanchikovoH * villageofstepanchikovoH
whitenightsH1 <- realism$Personal.Pronoun[124] - as.numeric(summary(realism$Personal.Pronoun)[5])
whitenightsH2 <- realism$Verb.non.3rd.person.singular.present[124] - as.numeric(summary(realism$Verb.3rd.person.singular.present)[5])
whitenights <- (whitenightsH1 * whitenightsH1) + (whitenightsH2 * whitenightsH2)

absalomH1 <- modernism$Adverb[99] - as.numeric(summary(modernism$Adverb)[5])
absalomH2 <- modernism$wh.determiner[99] - as.numeric(summary(modernism$wh.determiner)[5])
absalomH3 <- modernism$Possessive.wh.pronoun[99] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
absalom <- (absalomH2 * absalomH2) + (absalomH3 * absalomH3) + (absalomH1 * absalomH1)
acrosstheriverandintothetreesH <- modernism$Verb.non.3rd.person.singular.present[1] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
acrosstheriverandintothetrees <- acrosstheriverandintothetreesH * acrosstheriverandintothetreesH
asilaydyingL <- modernism$Verb.past.tense[100] - as.numeric(summary(modernism$Verb.past.tense)[2])
asilaydying <- (asilaydyingL * asilaydyingL)
theautobiographyofalicebtoklasH2 <- modernism$Existential[66] - as.numeric(summary(modernism$Existential)[5])
theautobiographyofalicebtoklasH1 <- modernism$Adverb[66] - as.numeric(summary(modernism$Adverb)[5])
theautobiographyofalicebtoklas <- (theautobiographyofalicebtoklasH2 * theautobiographyofalicebtoklasH2) + (theautobiographyofalicebtoklasH1 * theautobiographyofalicebtoklasH1)
beckettH1 <- modernism$Adjective.Superlative[80] - as.numeric(summary(modernism$Adjective.Superlative)[5])
beckettL1 <- modernism$Personal.Pronoun[80] - as.numeric(summary(modernism$Personal.Pronoun)[2])
beckettL2 <- modernism$Verb.past.tense[80] - as.numeric(summary(modernism$Verb.past.tense)[2])
beckettH <- modernism$Adverb[80] - as.numeric(summary(modernism$Adverb)[5])
beckettstories <- (beckettH1 * beckettH1) + (beckettL1 * beckettL1) + (beckettL2 * beckettL2) + (beckettH * beckettH)
companyL <- modernism$Verb.past.tense[84] - as.numeric(summary(modernism$Verb.past.tense)[2])
companyH <- modernism$Adverb[84] - as.numeric(summary(modernism$Adverb)[5])
company <- (companyL * companyL) + (companyH * companyH)
everybodysautobiographyH2 <- modernism$Existential[63] - as.numeric(summary(modernism$Existential)[5])
everybodysautobiographyH1 <- modernism$Adverb[63] - as.numeric(summary(modernism$Adverb)[5])
everybodysautobiography <- (everybodysautobiographyH2 * everybodysautobiographyH2) + (everybodysautobiographyH1 * everybodysautobiographyH1)
afableH <- modernism$Adverb[101] - as.numeric(summary(modernism$Adverb)[5])
afable <- (afableH * afableH)
finneganswakeL <- modernism$Verb.past.tense[62] - as.numeric(summary(modernism$Verb.past.tense)[2])
finneganswake <- finneganswakeL * finneganswakeL
firstloveH <- modernism$Possessive.wh.pronoun[76] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
firstlove <- firstloveH * firstloveH
fizzlesL <- modernism$Verb.past.tense[79] - as.numeric(summary(modernism$Verb.past.tense)[2])
fizzlesH <- modernism$Adverb[80] - as.numeric(summary(modernism$Adverb)[5])
fizzles <- (fizzlesL * fizzlesL) + (fizzlesH * fizzlesH)
forwhomthebelltollsH <- modernism$Existential[4] - as.numeric(summary(modernism$Existential)[5])
forwhomthebelltolls <- forwhomthebelltollsH * forwhomthebelltollsH
thefugitiveH1 <- modernism$Possessive.wh.pronoun[24] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
thefugitiveH <- modernism$wh.determiner[23] - as.numeric(summary(modernism$wh.determiner)[5])
thefugitive <- (thefugitiveH1 * thefugitiveH1) + (thefugitiveH * thefugitiveH)
theguermanteswayH <- modernism$wh.determiner[21] - as.numeric(summary(modernism$wh.determiner)[5])
theguermanteswayH1 <- modernism$Possessive.wh.pronoun[21] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
theguermantesway <- (theguermanteswayH * theguermanteswayH) + (theguermanteswayH1 * theguermanteswayH1)
howitisL <- modernism$Verb.past.tense[72] - as.numeric(summary(modernism$Verb.past.tense)[2])
howitisH2 <- modernism$Verb.non.3rd.person.singular.present[72] - as.numeric(summary(modernism$Verb.3rd.person.singular.present)[5])
howitisH1 <- modernism$Adverb[72] - as.numeric(summary(modernism$Adverb)[5])
howitis <- (howitisL * howitisL) + (howitisH2 * howitisH2) + (howitisH1 * howitisH1)
illseenillsaidL <- modernism$Verb.past.tense[83] - as.numeric(summary(modernism$Verb.past.tense)[2])
illseenillsaidH <- modernism$Adverb[83] - as.numeric(summary(modernism$Adverb)[5])
illseenillsaid <- (illseenillsaidL * illseenillsaidL) + (illseenillsaidH * illseenillsaidH)
inabuddinggroveH <- modernism$wh.determiner[20] - as.numeric(summary(modernism$wh.determiner)[5])
inabuddinggroveH1 <- modernism$Possessive.wh.pronoun[20] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
inabuddinggrove <- (inabuddinggroveH * inabuddinggroveH) + (inabuddinggroveH1 * inabuddinggroveH1)
intruderinthedustH <- modernism$Adverb[104] - as.numeric(summary(modernism$Adverb)[5])
intruderinthedust <- intruderinthedustH * intruderinthedustH
theivorytowerH1 <- modernism$Adjective.Superlative[207] - as.numeric(summary(modernism$Adjective.Superlative)[5])
theivorytowerH <- modernism$wh.determiner[207] - as.numeric(summary(modernism$wh.determiner)[5])
theivorytower <- (theivorytowerH1 * theivorytowerH1) + (theivorytowerH * theivorytowerH)
kafkaH <- modernism$Adverb[17] - as.numeric(summary(modernism$Adverb)[5])
kafka <- (kafkaH * kafkaH)
labyrinthsH <- modernism$Possessive.wh.pronoun[18] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
labyrinths <- labyrinthsH * labyrinthsH
makingofamericansH2 <- modernism$Existential[67] - as.numeric(summary(modernism$Existential)[5])
makingofamericansH1 <- modernism$Adverb[67] - as.numeric(summary(modernism$Adverb)[5])
makingofamericans <- (makingofamericansH2 * makingofamericansH2) + (makingofamericansH1 * makingofamericansH1)
malonediesL <- modernism$Verb.past.tense[87] - as.numeric(summary(modernism$Verb.past.tense)[2])
malonedies <- malonediesL * malonediesL
themansionH1 <- modernism$Adjective.Superlative[111] - as.numeric(summary(modernism$Adjective.Superlative)[5])
themansionH <- modernism$Adverb[111] - as.numeric(summary(modernism$Adverb)[5])
themansion <- (themansionH1 * themansionH1) + (themansionH * themansionH)
thenatureofacrimeL <- modernism$Verb.past.tense[125] - as.numeric(summary(modernism$Verb.past.tense)[2])
thenatureofacrimeH <- modernism$Verb.non.3rd.person.singular.present[125] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
thenatureofacrime <- (thenatureofacrimeL * thenatureofacrimeL) + (thenatureofacrimeH * thenatureofacrimeH)
theoutcryH <- modernism$Adjective.Superlative[205] - as.numeric(summary(modernism$Adjective.Superlative)[5])
theoutcry <- theoutcryH * theoutcryH
theprisonerH1 <- modernism$Possessive.wh.pronoun[23] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
theprisonerH <- modernism$wh.determiner[23] - as.numeric(summary(modernism$wh.determiner)[5])
theprisoner <- (theprisonerH * theprisonerH) + (theprisonerH1 * theprisonerH1)
requiemforanunL <- modernism$Verb.past.tense[107] - as.numeric(summary(modernism$Verb.past.tense)[2])
requiemforanun <- requiemforanunL * requiemforanunL
thesecretagentH1 <- modernism$Possessive.wh.pronoun[160] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
thesecretagent <- thesecretagentH1 * thesecretagentH1
thesenseofthepastH1 <- modernism$Adjective.Superlative[208] - as.numeric(summary(modernism$Adjective.Superlative)[5])
thesenseofthepastH <- modernism$wh.determiner[208] - as.numeric(summary(modernism$wh.determiner)[5])
thesenseofthepast <- (thesenseofthepastH1 * thesecretagentH1) + (thesenseofthepastH * thesenseofthepastH)
sodomandgomorrahH <- modernism$wh.determiner[22] - as.numeric(summary(modernism$wh.determiner)[5])
sodomandgomorrahH1 <- modernism$Possessive.wh.pronoun[22] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
sodomandgomorrah <- (sodomandgomorrahH1 * sodomandgomorrahH1) + (sodomandgomorrahH * sodomandgomorrahH)
thesoundandthefuryH <- modernism$Verb.non.3rd.person.singular.present[113] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
thesoundandthefury <- (thesoundandthefuryH * thesoundandthefuryH)
stirringsstillH1 <- modernism$Adverb[81] - as.numeric(summary(modernism$Adverb)[5])
stirringsstillH2 <- modernism$Preposition[81] - as.numeric(summary(modernism$Preposition)[5])
stirringsstill <- (stirringsstillH1 * stirringsstillH1) + (stirringsstillH2 * stirringsstillH2)
swannswayH <- modernism$wh.determiner[19] - as.numeric(summary(modernism$wh.determiner)[5])
swannswayH1 <- modernism$Possessive.wh.pronoun[19] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
swannsway <- (swannswayH * swannswayH) + (swannswayH1 * swannswayH1)
thesunalsorisesH <- max(modernism$Particle)
thesunalsorises <- thesunalsorisesH * thesunalsorisesH
tenderbuttonsL1 <- modernism$Personal.Pronoun[65] - as.numeric(summary(modernism$Personal.Pronoun)[2])
tenderbuttonsL2 <- modernism$Verb.past.tense[65] - as.numeric(summary(modernism$Verb.past.tense)[2])
tenderbuttonsH1 <- modernism$wh.determiner[65] - as.numeric(summary(modernism$wh.determiner)[5])
tenderbuttonsH2 <- modernism$Existential[65] - as.numeric(summary(modernism$Existential)[5])
tenderbuttonsL3 <- modernism$Preposition[65] - as.numeric(summary(modernism$Preposition)[2])
tenderbuttons <- (tenderbuttonsL3 * tenderbuttonsL3) + (tenderbuttonsL1 * tenderbuttonsL1) + (tenderbuttonsL2 * tenderbuttonsL2) + (tenderbuttonsH1 * tenderbuttonsH1) + (tenderbuttonsH2 * tenderbuttonsH2)
textsfornothingL <- modernism$Verb.past.tense[78] - as.numeric(summary(modernism$Verb.past.tense)[2])
textsfornothingH <- modernism$Possessive.wh.pronoun[78] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
textsfornothingH2 <- modernism$Existential[78] - as.numeric(summary(modernism$Existential)[5])
textsfornothingH1 <- modernism$Verb.non.3rd.person.singular.present[78] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
textsfornothing <- (textsfornothingL * textsfornothingL) + (textsfornothingH * textsfornothingH) + (textsfornothingH2 * textsfornothingH2) + (textsfornothingH1 * textsfornothingH1)
threelivesH <- modernism$Adverb[64] - as.numeric(summary(modernism$Adverb)[5])
threelives <- threelivesH * threelivesH
timeregainedH <- modernism$wh.determiner[25] - as.numeric(summary(modernism$wh.determiner)[5])
timeregainedH1 <- modernism$Possessive.wh.pronoun[25] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
timeregained <- (timeregainedH * timeregainedH) + (timeregainedH1 * timeregainedH1)
ulyssesL <- modernism$Verb.past.tense[59] - as.numeric(summary(modernism$Verb.past.tense)[2])
ulysses <- ulyssesL * ulyssesL
theunnamableL <- modernism$Verb.past.tense[71] - as.numeric(summary(modernism$Verb.past.tense)[2])
theunnamableH2 <- modernism$Existential[71] - as.numeric(summary(modernism$Existential)[5])
theunnamableH1 <- modernism$Verb.non.3rd.person.singular.present[71] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
theunnamable <- (theunnamableL * theunnamableL) + (theunnamableH2 * theunnamableH2) + (theunnamableH1 * theunnamableH1)
thevalleyofdecisionH <- modernism$Possessive.wh.pronoun[217] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
thevalleyofdecision <- thevalleyofdecisionH * thevalleyofdecisionH
wattH <- modernism$Possessive.wh.pronoun[69] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
watt <- wattH * wattH
thewavesL <- modernism$Verb.past.tense[92] - as.numeric(summary(modernism$Verb.past.tense)[2])
thewavesH <- max(modernism$Verb.non.3rd.person.singular.present) - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
thewaves <- (thewavesL * thewavesL) + (thewavesH * thewavesH)
worstwardhoL <- modernism$Personal.Pronoun[82] - as.numeric(summary(modernism$Personal.Pronoun)[2])
worstwardhoL1 <- modernism$Verb.past.tense[82] - as.numeric(summary(modernism$Verb.past.tense)[2])
worstwardhoH1 <- max(modernism$Adjective.Superlative)
worstwardhoH2 <- modernism$Possessive.wh.pronoun[82] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
worstwardhoH <- modernism$Adverb[82] - as.numeric(summary(modernism$Adverb)[5])
worstwardho <- (worstwardhoL * worstwardhoL) + (worstwardhoL1 * worstwardhoL1) + (worstwardhoH1 * worstwardhoH1) + (worstwardhoH2 * worstwardhoH2) + (worstwardhoH * worstwardhoH)
theyearsH <- modernism$Verb.past.tense[90] - as.numeric(summary(modernism$Verb.past.tense)[5])
theyears <- theyearsH * theyearsH

modernismresiduals <- c(absalom, acrosstheriverandintothetrees, asilaydying, theautobiographyofalicebtoklas, beckettstories, company, everybodysautobiography, afable, finneganswake, firstlove, fizzles, forwhomthebelltolls, thefugitive, theguermantesway, howitis, illseenillsaid, inabuddinggrove, intruderinthedust, theivorytower, kafka, labyrinths, makingofamericans, malonedies, themansion, thenatureofacrime, theoutcry, theprisoner, requiemforanun, thesecretagent, thesenseofthepast, sodomandgomorrah, thesoundandthefury, stirringsstill, swannsway, thesunalsorises, tenderbuttons, textsfornothing, threelives, timeregained, ulysses, theunnamable, thevalleyofdecision, watt, thewaves, worstwardho, theyears)

modernismtitles <- c("Absalom, Absalom", "Across The River and into the Trees", "As I Lay Dying", "The Autobiography of Alice B Toklas", "Beckett Stories", "Company", "Everybody's Autobiography", "A Fable", "Finnegans Wake", "First Love", "Fizzles", "For Whom the Bell Tolls", "The Fugitive", "The Guermantes Way", "How It Is", "Ill Seen Ill Said", "In A Budding Grove", "Intruder in the Dust", "The Ivory Tower", "Kafka", "Labyrinths", "Making of Americans", "Malone Dies", "The Mansion", "The Nature of a Crime", "The Outcry", "The Prisoner", "Requiem for a Nun", "The Secret Agent", "The Sense of the Past", "Sodom and Gomorrah", "The Sound and the Fury", "Stirrings Still", "Swann's Way", "The Sun Also Rises", "Tender Buttons", "Texts for Nothing", "Three Lives", "Time Regained", "Ulysses", "The Unnamable", "The Valley of Decision", "Watt", "The Waves", "Worstward Ho", "The Years")

modernismresiduals <- cbind.data.frame(modernismtitles, modernismresiduals)

faulkner <- sum(absalom, asilaydying, afable, intruderinthedust, themansion, requiemforanun, thesoundandthefury)
hemingway <- sum(acrosstheriverandintothetrees, forwhomthebelltolls, thesunalsorises)
stein <- sum(theautobiographyofalicebtoklas, everybodysautobiography, makingofamericans, tenderbuttons, threelives)
beckett <- sum(beckettstories, company, firstlove, fizzles, howitis, illseenillsaid, malonedies, stirringsstill, textsfornothing, theunnamable, watt, worstwardho)
joyce <- sum(finneganswake, ulysses)
proust <- sum(thefugitive, theguermantesway, inabuddinggrove, theprisoner, sodomandgomorrah, swannsway, timeregained)
james <- sum(theivorytower, theoutcry, thesenseofthepast)
conrad <- sum(thenatureofacrime, thesecretagent)
wharton <- thevalleyofdecision
woolf <- sum(thewaves, theyears)

modernistwriters <- c((faulkner / 7), (hemingway / 3), (stein / 5), (beckett / 12), (joyce / 2), (proust / 7), (james / 3), (conrad / 2), wharton, (woolf / 2), kafka, labyrinths)

modernistnames <- c("Faulkner", "Hemingway", "Stein", "Beckett", "Joyce", "Proust", "James", "Conrad", "Wharton", "Woolf", "Kafka", "Borges")

modernistwriters <- cbind.data.frame(modernistnames, modernistwriters)

plot(modernistwriters, main = "Mapping Outliers", xlab = "Names", ylab = "Grammatical Divergence")

realismtitles <- c("Balzac", "Eliot", "Zola", "Dostoevsky", "Crane", "Thackeray", "Austen")

balzac <- sum(anepisodeundertheterror, theballatsceaux, lettersoftwobrides, seraphita) / 4
eliot <- sum(theattackonthemill)
zola <- sum(thedeadwomanswish, hismasterpiece) / 2
dostoevsky <- sum(theinsultedandhumiliated, poorfolk, unclesdream, villageofstepanchikovo, whitenights) / 5
crane <- sum(thelittleregiment, theredbadgeofcourage, thethirdviolet) / 3
thackeray <- sum(lovelthewidower)
austen <- sum(mansfieldpark)

realismnovelists <- c(balzac, eliot, zola, dostoevsky, crane, thackeray, austen)

realism <- cbind.data.frame(realismtitles, realismnovelists)

plot(realism, main = "Grammatical Divergence amongst Realist Novelists", xlab = "Names", ylab = "Divergence (%)")

centuries %>% group_by(Year) %>%
  summarise(sentence_summary = mean(Slength)) -> sentence_summary

sentence_summary %>% dygraph(main = "Sentence Lengths 1794 - 1988", ylab = "Sentence Length (Words)", xlab = "Year") %>% dyRangeSelector()
## Warning in strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz): unknown
## timezone 'zone/tz/2017c.1.0/zoneinfo/Europe/Dublin'
correlationdata <- cbind(centuries[,32], centuries[,15], centuries[,23], centuries[,17], centuries[,6], centuries[,28], centuries[,20], centuries[,26], centuries[,5], centuries[,30], centuries[,9])

colnames(correlationdata) <- c("Sentence Length", "Personal Pronouns", "Verb Past Tense", "Adverb", "Prepositions", "Wh Determiners", "Particles", "Verb 3rd Person Singular Pronouns", "Existentials", "Possessive wh Pronouns", "Adjective Suplerlatives")

res <- cor(correlationdata)
round(res, 2)
##                                   Sentence Length Personal Pronouns
## Sentence Length                              1.00             -0.29
## Personal Pronouns                           -0.29              1.00
## Verb Past Tense                             -0.30              0.54
## Adverb                                      -0.06              0.38
## Prepositions                                 0.49             -0.39
## Wh Determiners                               0.66             -0.40
## Particles                                   -0.44              0.31
## Verb 3rd Person Singular Pronouns           -0.13              0.05
## Existentials                                -0.13              0.15
## Possessive wh Pronouns                       0.44             -0.50
## Adjective Suplerlatives                      0.36             -0.26
##                                   Verb Past Tense Adverb Prepositions
## Sentence Length                             -0.30  -0.06         0.49
## Personal Pronouns                            0.54   0.38        -0.39
## Verb Past Tense                              1.00   0.02        -0.24
## Adverb                                       0.02   1.00        -0.10
## Prepositions                                -0.24  -0.10         1.00
## Wh Determiners                              -0.38  -0.26         0.58
## Particles                                    0.52   0.12        -0.33
## Verb 3rd Person Singular Pronouns           -0.56   0.06        -0.36
## Existentials                                 0.05   0.29        -0.23
## Possessive wh Pronouns                      -0.35  -0.31         0.41
## Adjective Suplerlatives                     -0.43   0.16         0.24
##                                   Wh Determiners Particles
## Sentence Length                             0.66     -0.44
## Personal Pronouns                          -0.40      0.31
## Verb Past Tense                            -0.38      0.52
## Adverb                                     -0.26      0.12
## Prepositions                                0.58     -0.33
## Wh Determiners                              1.00     -0.62
## Particles                                  -0.62      1.00
## Verb 3rd Person Singular Pronouns          -0.09     -0.19
## Existentials                               -0.15      0.15
## Possessive wh Pronouns                      0.70     -0.51
## Adjective Suplerlatives                     0.25     -0.35
##                                   Verb 3rd Person Singular Pronouns
## Sentence Length                                               -0.13
## Personal Pronouns                                              0.05
## Verb Past Tense                                               -0.56
## Adverb                                                         0.06
## Prepositions                                                  -0.36
## Wh Determiners                                                -0.09
## Particles                                                     -0.19
## Verb 3rd Person Singular Pronouns                              1.00
## Existentials                                                   0.01
## Possessive wh Pronouns                                        -0.07
## Adjective Suplerlatives                                        0.09
##                                   Existentials Possessive wh Pronouns
## Sentence Length                          -0.13                   0.44
## Personal Pronouns                         0.15                  -0.50
## Verb Past Tense                           0.05                  -0.35
## Adverb                                    0.29                  -0.31
## Prepositions                             -0.23                   0.41
## Wh Determiners                           -0.15                   0.70
## Particles                                 0.15                  -0.51
## Verb 3rd Person Singular Pronouns         0.01                  -0.07
## Existentials                              1.00                  -0.31
## Possessive wh Pronouns                   -0.31                   1.00
## Adjective Suplerlatives                  -0.10                   0.22
##                                   Adjective Suplerlatives
## Sentence Length                                      0.36
## Personal Pronouns                                   -0.26
## Verb Past Tense                                     -0.43
## Adverb                                               0.16
## Prepositions                                         0.24
## Wh Determiners                                       0.25
## Particles                                           -0.35
## Verb 3rd Person Singular Pronouns                    0.09
## Existentials                                        -0.10
## Possessive wh Pronouns                               0.22
## Adjective Suplerlatives                              1.00
#Wh. Determiners and sentence length positively correlated, possessive wh. pronouns + wh determiners positively correlated