This is a methodological account of a preliminary data exploration and analysis which was carried out on a corpus of 500 novels. 250 of these texts are generally categorised as belonging to a genre called ‘realism’ and will be used in this context as a benchmark against which modernist literary style may be defined. The first novel in the naturalistic corpus, chronologically speaking, is Jane Austen’s novel ‘Lady Susan’, and was written in the year 1794. The final one is Thomas Hardy’s novel ‘Jude the Obscure’, published in 1895. This corpus contains the complete prose works, a phrase here encompassing novels, novellas and short story collections, of fifteen writers, Jane Austen, Emily, Anne and Charlotte Bronte, Stephen Crane, Honoré de Balzac, Charles Dickens, Fyodor Dostoevsky, George Eliot, Gustave Flaubert, Elizabeth Gaskell, Thomas Hardy, William Makepeace Thackeray, Leo Tolstoy and Émile Zola.
The corpus of 250 modernist novels begins in the year 1869, with Henry James’ first bloc of short stories, and continues all the way to Samuel Beckett’s 1988 novella ‘Stirrings Still’, so there is some overlap between these two corpora’s starting and end points. This modernist corpus otherwise consists of the complete works of nineteen writers including Djuna Barnes, Samuel Beckett, Jorge Luis Borges, Elizabeth Bowen, Joseph Conrad, William Faulkner, F. Scott FitzGerald, Ford Madox Ford, Ernest Hemingway, Henry James, James Joyce, Franz Kakfa, D.H. Lawrence, Katherine Mansfield, Flann O’Brien, Marcel Proust, Gertrude Stein, Edith Wharton and Virginia Woolf.
This disproportion between the two corpora, with fifteen realists versus ninteen modernists, may seem disconcerting at first, but what is required in order for the statistical analyses to function is for the number of observations to be equal, rather than the number of novelists. Unfortunately, realist authors wrote more novels than modernist authors, and this compromised our ability to retain the same number of authors on each end of the generic spectrum. The decision was therefore reached to expand the number of modernist texts where possible. Rather than treating each short story as a text in isolation, or as a single monolithic corpus, such as ‘The Complete Short Stories of Ford Madox Ford’ for example, an attempt was made to preserve the initial form in which the texts were published. FitzGerald, to give one example, published a number of short story collections, such as ‘Tales of the Jazz Age’ (1922) and the ‘Pat Hobby Stories’ (1941), but he also published many other texts in magazines throughout his career. These intervals were maintained in blocs of 1909 to 1917, 1920 to 1925, 1926 to 1934, 1935 to 1940 and stories published posthumously from 1940 onwards. This was likewise the case for James; his short stories were periodised into blocs of stories published from 1864 to 1869, 1870 to 1879, 1880 to 1889, 1890 to 1899, 1900 to 1909, and finally stories he published from 1910 onwards. One exception was Joseph Conrad’s novel ‘Heart of Darkness’, which was first published in three parts in Blackwood’s Magazine in 1899, and thereafter in 1902, as part of a publication entitled ‘Youth: A Narrative and two Other Stories’. This book also included a short autobiographical story which gives the collection its name, and Conrad’s novel ‘The End of the Tether’. It was decided that since ‘Heart of Darkness’ is usually studied as a novel in its own right, and as a significant modernist literary milestone, that quantifying its grammar apart from these other texts was justifiable, particularly if reaching the target of 250 separate modernist texts was required. This also allows us to do greater analytical justice to how a writer’s prose style may change over the course of their writing career, particularly in the context of their short prose fiction.
It was also necessary to make adjustments to reduce the number of realist texts. This was primarily due to Balzac, as he is responsible for authoring over one hundred prose works. Fortunately, Balzac’s writings were published in thematic cycles, such as, for example, ‘Scenes from Parisian Life’, ‘Scenes from Political Life’ and ‘Catherine di Medici’. The length and category of the texts in each of these groups vary widely, some are novels, some are novellas, many are short stories, and the decision was made to collapse these short texts according to these broader groupings, with the result that Balzac’s short stories were the category most effected by this attenuation of the number of realist texts. This was done in lieu of removing a large number of Balzac’s works from the corpus altogether, for the reason that this would require making assumptions about which of his writings are more representative or canonical and therefore more worthy of inclusion in the corpus.
One final aspect to consider is the international dimension. The realist corpus includes ten novelists who wrote in English, but there are also two Russian and three French realists, two of whom, Zola and the aforementioned Balzac, were far more prolific than any other writer in either corpus. Zola and Balzac composed 86 and 34 novels, short story collections or novellas respectively. This has the consequence that well over half of the realist corpus is in translation from another language in comparison to just under 10% of the modernist corpus. I intend to address this when I am at a later stage in my research, there has been some work published in digital humanities journals on the issues surrounding the quantification of literature in translation and across language, but I do not yet possess a sufficient breadth of knowledge in this field to comment intelligently on the matter as yet. I do think it is important to have French and Russian writers included in the realist corpus on the basis that many of them, be they Tolstoy, Flaubert or Balzac, exerted a significant influence on their modernist successors, including James, Woolf or Joyce, amongst others.
With regard to the issue of translation, as with so much else in quantitative analysis, we must turn to the issue of what data is available for us to capture. It was very straightforward to find the complete works of these novelists in a relatively clean format, the problem is I could only find one of each which was out of copyright. Whether or not these are ‘the best’ or most accurate translations is sort of beside the point, from the reading I have done around the issue of literary translation, it’s clear that translations change over time, this is in the nature of how text is received and re-constituted in different eras for different communities of readers, but the most germane point here is that the translations being analysed in this instance could not be considered to be the most contemporary. There might be an argument for retaining these older translations on the basis that they are more likely to be the versions of the text which would have been circulating in the early twentieth century and therefore the translations modernist authors would have been more likely to have read, but making this claim would require a greater burden of proof, such as what languages each author read novels in and what their reading habits were more generally. The point is, there are translations in this corpus which would not be regarded as the standard English versions of the text today, such as C.K. Scott Moncrieff’s ‘Remembrance of Things Past’, his translation of Marcel Proust’s ‘Á la récherche du temps perdu’, published between 1922 and 1931, which has since been superseded, first by D.J. Enright’s in 1992, and then by Lydia Davis and others in 2002.
So, to turn to the analysis itself. My research is directed towards the quantitative analysis of grammar, the rationale being that we could, by examining the varying quantities of particular categories of words, such as verbs, adjectives or prepositions, develop an understanding of how literary fiction changes from the beginning of the nineteenth century until the end of the twentieth, and, more specifically, how literary modernism departs from, or, perhaps even remains contiguous with, this previous generation of novel writing. So determining the stylistic relationship between these two genres based on grammar is what I am interested in doing.
Though it might seem as though we command a fairly broad overview of two centuries of literary history, considering we are analysing 500 novels from 1796 to 1988, it should be noted that this is by no means a prospect for a definitive answer on the matter of modernist literary style; there are only five hundred observations for each variable, which would bring us to just under 2.6 books per year, a tiny, tiny percentage of what is actually published in this time period.
These texts were obtained in .mobi and .azw formats and converted into .txt files using an ebook management software called Calibre. Front and back matter were removed, such as introductions, forewords, afterwords, followed by titles, chapter names and section divisions. The corpora were then split up into separate novels, and read into the Python workspace. The data was inspected thoroughly at each stage in the process to ensure that it was not becoming corrupted or being made more dirty rather than less.
Now, the programming language Python, contains a natural langauge processing library called Natural Language Toolkit, or NLTK, and this is the library that we obtain our Part of Speech or POS tagger from. Once each .txt file was read into the workspace and tokenized, the POS tagger looped through each one, and assigned each word to one of thirty-five different grammatical categories. These were then pared down to twenty-nine because some of them were a bit too specific to register in the text as existing on any scale which could be considered significant, such as ‘foreign words’, list objects and exclamations. As such, we quantified coordinating conjunctions, determiners, existentials, prepositions, adjectives, modals, nouns, predeterminers, pronouns, adverbs, particles, the word ‘to’ and verbs. It should be noted that adjectives and adverbs are split into three separate categories of normal, comparative and superlative, and that nouns can be singular or plural. Verbs can also be base, past tense, gerunds, past participles, third and non-third person singular presents. Finally, we quantified the number of full stops and divided this by the number of words in each text in order to reach an average quantity of sentence length. And this was the first variable which was quantified. Just before we see that though, we should note that quantifying sentence length presents us with a unique difficulty related to Beckett’s 1964 novel ‘How It Is’, published initially as ‘Comment C’est’ in 1961.
‘How It Is’ contains no punctuation, full stops or commas, so the question as to which way in which we might quantify the average length of the sentence is something of a conundrum. The first impulse might be to say just count all 36347 of its words as one big sentence, but this of course would make it a major outlier, and render the boxplot useless, and seriously skew the data. Plotting it as zero would be less detrimental to the analysis but also inaccurate, how could we say that average sentence length is zero words? So we exclude ‘How It Is’ as a major outlier, and also Anne Bronte’s novel ‘Agnes Grey’ as that is the higher realism outlier for sentence length, which is again, not ideal, but the analysis won’t work otherwise.
realism <- read.csv("realism.csv", stringsAsFactors = FALSE)
modernism <- read.csv("modernism.csv", stringsAsFactors = FALSE)
centuries <- read.csv("centuries.csv", stringsAsFactors = FALSE)
centuries$Slength[322] <- 0
centuries$Slength <- as.numeric(centuries$Slength)
attach(centuries)
modernism$Slength[72] <- 0
modernismsentencelengths <- modernism$Slength[-72]
modernismsentencelengths <- as.numeric(modernismsentencelengths)
realismsentencelengths <- realism$Slength[-19]
hist(realismsentencelengths, main = "Sentence Length in Realist Novels", xlab = "Sentence Length (Words)")
hist(modernismsentencelengths, main = "Sentence Length in Modernist Novels", xlab = "Sentence Length (Words)")
var(realismsentencelengths)
## [1] 14.29844
var(modernismsentencelengths)
## [1] 37.06567
var.test(realismsentencelengths, modernismsentencelengths)
##
## F test to compare two variances
##
## data: realismsentencelengths and modernismsentencelengths
## F = 0.38576, num df = 248, denom df = 248, p-value = 1.866e-13
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3005835 0.4950720
## sample estimates:
## ratio of variances
## 0.3857596
wilcox.test(realismsentencelengths, modernismsentencelengths)
##
## Wilcoxon rank sum test with continuity correction
##
## data: realismsentencelengths and modernismsentencelengths
## W = 45263, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realismsentencelengths, modernismsentencelengths, notch = TRUE, main = "Sentence Length in Realist versus Modernist Novels", names = c("Realism", "Modernism"), ylab = "Sentence Length (Words)", xlab = "Genre")
median(realismsentencelengths)
## [1] 22.30144
median(modernismsentencelengths)
## [1] 18.44173
median(realismsentencelengths) - median(modernismsentencelengths)
## [1] 3.859712
hist(realism$Personal.Pronoun, main = "Personal Pronouns in Realist Novels", xlab = "Personal Pronouns in Realist Novels (%)")
hist(modernism$Personal.Pronoun, main = "Personal Pronouns in Modernist Novels", xlab = "Personal Pronouns in Modernist Novels (%)")
var(realism$Personal.Pronoun)
## [1] 0.7873269
var(modernism$Personal.Pronoun)
## [1] 1.261971
var.test(realism$Personal.Pronoun, modernism$Personal.Pronoun)
##
## F test to compare two variances
##
## data: realism$Personal.Pronoun and modernism$Personal.Pronoun
## F = 0.62389, num df = 249, denom df = 249, p-value = 0.0002143
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4863769 0.8002738
## sample estimates:
## ratio of variances
## 0.6238868
wilcox.test(realism$Personal.Pronoun, modernism$Personal.Pronoun)
##
## Wilcoxon rank sum test with continuity correction
##
## data: realism$Personal.Pronoun and modernism$Personal.Pronoun
## W = 15117, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Personal.Pronoun, modernism$Personal.Pronoun, notch = TRUE, main = "Personal Pronouns in Realist versus Modernist Novels", ylab = "Personal Pronouns (%)", xlab = "Genre", names = c("Realism", "Modernism"))
median(realism$Personal.Pronoun)
## [1] 5.364828
median(modernism$Personal.Pronoun)
## [1] 6.436688
median(realism$Personal.Pronoun) - median(modernism$Personal.Pronoun)
## [1] -1.071859
hist(realism$Preposition, main = "Distribution of Prepositions in Realist Novels", xlab = "Prepositions (%)")
hist(modernism$Preposition, main = "Distribution of Prepositions in Modernist Novels", xlab = "Prepositions (%)")
var(realism$Preposition)
## [1] 0.5668526
var(modernism$Preposition)
## [1] 1.466232
var.test(realism$Preposition, modernism$Preposition)
##
## F test to compare two variances
##
## data: realism$Preposition and modernism$Preposition
## F = 0.38661, num df = 249, denom df = 249, p-value = 1.887e-13
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3013941 0.4959071
## sample estimates:
## ratio of variances
## 0.386605
median(realism$Preposition)
## [1] 10.89971
median(modernism$Preposition)
## [1] 10.46074
median(realism$Preposition) - median(modernism$Preposition)
## [1] 0.4389673
wilcox.test(realism$Preposition, modernism$Preposition)
##
## Wilcoxon rank sum test with continuity correction
##
## data: realism$Preposition and modernism$Preposition
## W = 37345, p-value = 0.0001614
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Preposition, modernism$Preposition, notch = TRUE, main = "Percentage of Prepositions in Realist versus Modernist Novels", xlab = "Genre", ylab = "Prepositions (%)", names = c("Realism", "Modernism"))
hist(realism$Verb.non.3rd.person.singular.present, main = "Non third-person singular present verbs in realist novels", xlab = "Non third-person singular present verbs (%)")
hist(modernism$Verb.non.3rd.person.singular.present, main = "Non third-person singular present verbs in modernist novels", xlab = "Non third-person singular present verbs (%)")
var(realism$Verb.non.3rd.person.singular.present)
## [1] 0.226187
var(modernism$Verb.non.3rd.person.singular.present)
## [1] 0.2548945
var.test(realism$Verb.non.3rd.person.singular.present, modernism$Verb.non.3rd.person.singular.present)
##
## F test to compare two variances
##
## data: realism$Verb.non.3rd.person.singular.present and modernism$Verb.non.3rd.person.singular.present
## F = 0.88737, num df = 249, denom df = 249, p-value = 0.3464
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6917901 1.1382561
## sample estimates:
## ratio of variances
## 0.8873749
wilcox.test(realism$Verb.non.3rd.person.singular.present, modernism$Verb.non.3rd.person.singular.present)
##
## Wilcoxon rank sum test with continuity correction
##
## data: realism$Verb.non.3rd.person.singular.present and modernism$Verb.non.3rd.person.singular.present
## W = 34747, p-value = 0.03042
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Verb.non.3rd.person.singular.present, modernism$Verb.non.3rd.person.singular.present, notch = TRUE, main = "Non Third Person Singular Present Verbs in Realism Versus Modernism", xlab = "Genre", ylab = "Non Third Person Singular Present Verbs (%)", names = c("Realism", "Modernism"))
median(realism$Verb.non.3rd.person.singular.present)
## [1] 1.581927
median(modernism$Verb.non.3rd.person.singular.present)
## [1] 1.458354
median(realism$Verb.non.3rd.person.singular.present) - median(modernism$Verb.non.3rd.person.singular.present)
## [1] 0.1235726
hist(realism$Existential, main = "Existentials in Realist Novels", xlab = "Existentials (%)")
hist(modernism$Existential, main = "Existentials in Modernist Novels", xlab = "Existentials (%)")
var(realism$Existential)
## [1] 0.003356196
var(modernism$Existential)
## [1] 0.01716611
var.test(realism$Existential, modernism$Existential)
##
## F test to compare two variances
##
## data: realism$Existential and modernism$Existential
## F = 0.19551, num df = 249, denom df = 249, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1524202 0.2507888
## sample estimates:
## ratio of variances
## 0.1955129
wilcox.test(realism$Existential, modernism$Existential)
##
## Wilcoxon rank sum test with continuity correction
##
## data: realism$Existential and modernism$Existential
## W = 17183, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Existential, modernism$Existential, notch = TRUE, xlab = "Genre", ylab = "Existentials (%)", names = c("Realism", "Modernism"))
median(realism$Existential)
## [1] 0.1655106
median(modernism$Existential)
## [1] 0.2081313
median(realism$Existential) - median(modernism$Existential)
## [1] -0.0426207
hist(realism$Adjective.Superlative, main = "Superlative Adjectives in Realist Novels", xlab = "Superlative Adjectives (%)")
hist(modernism$Adjective.Superlative, main = "Superlative Adjectives in Modernist Novels", xlab = "Superlative Adjectives (%)")
var(realism$Adjective.Superlative)
## [1] 0.002045467
var(modernism$Adjective.Superlative)
## [1] 0.003396895
var.test(realism$Adjective.Superlative, modernism$Adjective.Superlative)
##
## F test to compare two variances
##
## data: realism$Adjective.Superlative and modernism$Adjective.Superlative
## F = 0.60216, num df = 249, denom df = 249, p-value = 6.982e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4694373 0.7724017
## sample estimates:
## ratio of variances
## 0.6021579
wilcox.test(realism$Adjective.Superlative, modernism$Adjective.Superlative)
##
## Wilcoxon rank sum test with continuity correction
##
## data: realism$Adjective.Superlative and modernism$Adjective.Superlative
## W = 35867, p-value = 0.004265
## alternative hypothesis: true location shift is not equal to 0
boxplot(realism$Adjective.Superlative, modernism$Adjective.Superlative, notch = TRUE, main = "Superlative Adjectives in Realist Versus Modernist Novels", xlab = "Genre", ylab = "Adjective Superlatives (%)", names = c("Realism", "Modernism"))
median(realism$Adjective.Superlative)
## [1] 0.1359699
median(modernism$Adjective.Superlative)
## [1] 0.1282673
median(realism$Adjective.Superlative) - median(modernism$Adjective.Superlative)
## [1] 0.007702564
anepisodeundertheterrorH <- realism$Adjective.Superlative[221] - as.numeric(summary(realism$Adjective.Superlative)[5])
anepisodeundertheterror <- anepisodeundertheterrorH * anepisodeundertheterrorH
theattackonthemillH <- max(realism$Particle)
theattackonthemill <- theattackonthemillH * theattackonthemillH
theballatsceauxH <- realism$Adjective.Superlative[166] - as.numeric(summary(realism$Adjective.Superlative)[5])
theballatsceaux <- theballatsceauxH * theballatsceauxH
thedeadwomanswishH <- realism$Personal.Pronoun[21] - as.numeric(summary(realism$Personal.Pronoun)[5])
thedeadwomanswish <- thedeadwomanswishH * thedeadwomanswishH
hismasterpieceH <- realism$Existential[36] - as.numeric(summary(realism$Existential)[5])
hismasterpiece <- hismasterpieceH * hismasterpieceH
theinsultedandhumiliatedH <- realism$Personal.Pronoun[112] - as.numeric(summary(realism$Personal.Pronoun)[5])
theinsultedandhumiliated <- theinsultedandhumiliatedH * theinsultedandhumiliatedH
thelittleregimentH <- max(realism$Existential) - as.numeric(summary(realism$Existential)[5])
thelittleregiment <- thelittleregimentH * thelittleregimentH
lettersoftwobridesL <- realism$Verb.past.tense[167] - as.numeric(summary(realism$Verb.past.tense)[2])
lettersoftwobrides <- lettersoftwobridesL * lettersoftwobridesL
lovelthewidowerL <- realism$Preposition[135] - as.numeric(summary(realism$Preposition)[5])
lovelthewidower <- lovelthewidowerL * lovelthewidowerL
mansfieldparkH <- realism$Adjective.Superlative[61] - as.numeric(summary(realism$Adjective.Superlative)[5])
mansfieldpark <- mansfieldparkH * mansfieldparkH
poorfolkH <- realism$Verb.non.3rd.person.singular.present[107] - as.numeric(summary(realism$Verb.non.3rd.person.singular.present)[5])
poorfolk <- poorfolkH * poorfolkH
seraphitaH <- realism$wh.determiner[248] - as.numeric(summary(realism$wh.determiner)[5])
seraphita <- seraphitaH * seraphitaH
theredbadgeofcourageH <- realism$Existential[3] - as.numeric(summary(realism$Existential)[5])
theredbadgeofcourage <- theredbadgeofcourageH * theredbadgeofcourageH
thethirdvioletL <- realism$Preposition[5] - as.numeric(summary(realism$Preposition)[5])
thethirdviolet <- thethirdvioletL * thethirdvioletL
unclesdreamH <- realism$Verb.non.3rd.person.singular.present[100] - as.numeric(summary(realism$Verb.non.3rd.person.singular.present)[5])
unclesdream <- unclesdreamH * unclesdreamH
villageofstepanchikovoH <- realism$Verb.non.3rd.person.singular.present[111] - as.numeric(summary(realism$Verb.non.3rd.person.singular.present)[5])
villageofstepanchikovo <- villageofstepanchikovoH * villageofstepanchikovoH
whitenightsH1 <- realism$Personal.Pronoun[124] - as.numeric(summary(realism$Personal.Pronoun)[5])
whitenightsH2 <- realism$Verb.non.3rd.person.singular.present[124] - as.numeric(summary(realism$Verb.3rd.person.singular.present)[5])
whitenights <- (whitenightsH1 * whitenightsH1) + (whitenightsH2 * whitenightsH2)
absalomH1 <- modernism$Adverb[99] - as.numeric(summary(modernism$Adverb)[5])
absalomH2 <- modernism$wh.determiner[99] - as.numeric(summary(modernism$wh.determiner)[5])
absalomH3 <- modernism$Possessive.wh.pronoun[99] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
absalom <- (absalomH2 * absalomH2) + (absalomH3 * absalomH3) + (absalomH1 * absalomH1)
acrosstheriverandintothetreesH <- modernism$Verb.non.3rd.person.singular.present[1] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
acrosstheriverandintothetrees <- acrosstheriverandintothetreesH * acrosstheriverandintothetreesH
asilaydyingL <- modernism$Verb.past.tense[100] - as.numeric(summary(modernism$Verb.past.tense)[2])
asilaydying <- (asilaydyingL * asilaydyingL)
theautobiographyofalicebtoklasH2 <- modernism$Existential[66] - as.numeric(summary(modernism$Existential)[5])
theautobiographyofalicebtoklasH1 <- modernism$Adverb[66] - as.numeric(summary(modernism$Adverb)[5])
theautobiographyofalicebtoklas <- (theautobiographyofalicebtoklasH2 * theautobiographyofalicebtoklasH2) + (theautobiographyofalicebtoklasH1 * theautobiographyofalicebtoklasH1)
beckettH1 <- modernism$Adjective.Superlative[80] - as.numeric(summary(modernism$Adjective.Superlative)[5])
beckettL1 <- modernism$Personal.Pronoun[80] - as.numeric(summary(modernism$Personal.Pronoun)[2])
beckettL2 <- modernism$Verb.past.tense[80] - as.numeric(summary(modernism$Verb.past.tense)[2])
beckettH <- modernism$Adverb[80] - as.numeric(summary(modernism$Adverb)[5])
beckettstories <- (beckettH1 * beckettH1) + (beckettL1 * beckettL1) + (beckettL2 * beckettL2) + (beckettH * beckettH)
companyL <- modernism$Verb.past.tense[84] - as.numeric(summary(modernism$Verb.past.tense)[2])
companyH <- modernism$Adverb[84] - as.numeric(summary(modernism$Adverb)[5])
company <- (companyL * companyL) + (companyH * companyH)
everybodysautobiographyH2 <- modernism$Existential[63] - as.numeric(summary(modernism$Existential)[5])
everybodysautobiographyH1 <- modernism$Adverb[63] - as.numeric(summary(modernism$Adverb)[5])
everybodysautobiography <- (everybodysautobiographyH2 * everybodysautobiographyH2) + (everybodysautobiographyH1 * everybodysautobiographyH1)
afableH <- modernism$Adverb[101] - as.numeric(summary(modernism$Adverb)[5])
afable <- (afableH * afableH)
finneganswakeL <- modernism$Verb.past.tense[62] - as.numeric(summary(modernism$Verb.past.tense)[2])
finneganswake <- finneganswakeL * finneganswakeL
firstloveH <- modernism$Possessive.wh.pronoun[76] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
firstlove <- firstloveH * firstloveH
fizzlesL <- modernism$Verb.past.tense[79] - as.numeric(summary(modernism$Verb.past.tense)[2])
fizzlesH <- modernism$Adverb[80] - as.numeric(summary(modernism$Adverb)[5])
fizzles <- (fizzlesL * fizzlesL) + (fizzlesH * fizzlesH)
forwhomthebelltollsH <- modernism$Existential[4] - as.numeric(summary(modernism$Existential)[5])
forwhomthebelltolls <- forwhomthebelltollsH * forwhomthebelltollsH
thefugitiveH1 <- modernism$Possessive.wh.pronoun[24] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
thefugitiveH <- modernism$wh.determiner[23] - as.numeric(summary(modernism$wh.determiner)[5])
thefugitive <- (thefugitiveH1 * thefugitiveH1) + (thefugitiveH * thefugitiveH)
theguermanteswayH <- modernism$wh.determiner[21] - as.numeric(summary(modernism$wh.determiner)[5])
theguermanteswayH1 <- modernism$Possessive.wh.pronoun[21] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
theguermantesway <- (theguermanteswayH * theguermanteswayH) + (theguermanteswayH1 * theguermanteswayH1)
howitisL <- modernism$Verb.past.tense[72] - as.numeric(summary(modernism$Verb.past.tense)[2])
howitisH2 <- modernism$Verb.non.3rd.person.singular.present[72] - as.numeric(summary(modernism$Verb.3rd.person.singular.present)[5])
howitisH1 <- modernism$Adverb[72] - as.numeric(summary(modernism$Adverb)[5])
howitis <- (howitisL * howitisL) + (howitisH2 * howitisH2) + (howitisH1 * howitisH1)
illseenillsaidL <- modernism$Verb.past.tense[83] - as.numeric(summary(modernism$Verb.past.tense)[2])
illseenillsaidH <- modernism$Adverb[83] - as.numeric(summary(modernism$Adverb)[5])
illseenillsaid <- (illseenillsaidL * illseenillsaidL) + (illseenillsaidH * illseenillsaidH)
inabuddinggroveH <- modernism$wh.determiner[20] - as.numeric(summary(modernism$wh.determiner)[5])
inabuddinggroveH1 <- modernism$Possessive.wh.pronoun[20] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
inabuddinggrove <- (inabuddinggroveH * inabuddinggroveH) + (inabuddinggroveH1 * inabuddinggroveH1)
intruderinthedustH <- modernism$Adverb[104] - as.numeric(summary(modernism$Adverb)[5])
intruderinthedust <- intruderinthedustH * intruderinthedustH
theivorytowerH1 <- modernism$Adjective.Superlative[207] - as.numeric(summary(modernism$Adjective.Superlative)[5])
theivorytowerH <- modernism$wh.determiner[207] - as.numeric(summary(modernism$wh.determiner)[5])
theivorytower <- (theivorytowerH1 * theivorytowerH1) + (theivorytowerH * theivorytowerH)
kafkaH <- modernism$Adverb[17] - as.numeric(summary(modernism$Adverb)[5])
kafka <- (kafkaH * kafkaH)
labyrinthsH <- modernism$Possessive.wh.pronoun[18] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
labyrinths <- labyrinthsH * labyrinthsH
makingofamericansH2 <- modernism$Existential[67] - as.numeric(summary(modernism$Existential)[5])
makingofamericansH1 <- modernism$Adverb[67] - as.numeric(summary(modernism$Adverb)[5])
makingofamericans <- (makingofamericansH2 * makingofamericansH2) + (makingofamericansH1 * makingofamericansH1)
malonediesL <- modernism$Verb.past.tense[87] - as.numeric(summary(modernism$Verb.past.tense)[2])
malonedies <- malonediesL * malonediesL
themansionH1 <- modernism$Adjective.Superlative[111] - as.numeric(summary(modernism$Adjective.Superlative)[5])
themansionH <- modernism$Adverb[111] - as.numeric(summary(modernism$Adverb)[5])
themansion <- (themansionH1 * themansionH1) + (themansionH * themansionH)
thenatureofacrimeL <- modernism$Verb.past.tense[125] - as.numeric(summary(modernism$Verb.past.tense)[2])
thenatureofacrimeH <- modernism$Verb.non.3rd.person.singular.present[125] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
thenatureofacrime <- (thenatureofacrimeL * thenatureofacrimeL) + (thenatureofacrimeH * thenatureofacrimeH)
theoutcryH <- modernism$Adjective.Superlative[205] - as.numeric(summary(modernism$Adjective.Superlative)[5])
theoutcry <- theoutcryH * theoutcryH
theprisonerH1 <- modernism$Possessive.wh.pronoun[23] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
theprisonerH <- modernism$wh.determiner[23] - as.numeric(summary(modernism$wh.determiner)[5])
theprisoner <- (theprisonerH * theprisonerH) + (theprisonerH1 * theprisonerH1)
requiemforanunL <- modernism$Verb.past.tense[107] - as.numeric(summary(modernism$Verb.past.tense)[2])
requiemforanun <- requiemforanunL * requiemforanunL
thesecretagentH1 <- modernism$Possessive.wh.pronoun[160] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
thesecretagent <- thesecretagentH1 * thesecretagentH1
thesenseofthepastH1 <- modernism$Adjective.Superlative[208] - as.numeric(summary(modernism$Adjective.Superlative)[5])
thesenseofthepastH <- modernism$wh.determiner[208] - as.numeric(summary(modernism$wh.determiner)[5])
thesenseofthepast <- (thesenseofthepastH1 * thesecretagentH1) + (thesenseofthepastH * thesenseofthepastH)
sodomandgomorrahH <- modernism$wh.determiner[22] - as.numeric(summary(modernism$wh.determiner)[5])
sodomandgomorrahH1 <- modernism$Possessive.wh.pronoun[22] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
sodomandgomorrah <- (sodomandgomorrahH1 * sodomandgomorrahH1) + (sodomandgomorrahH * sodomandgomorrahH)
thesoundandthefuryH <- modernism$Verb.non.3rd.person.singular.present[113] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
thesoundandthefury <- (thesoundandthefuryH * thesoundandthefuryH)
stirringsstillH1 <- modernism$Adverb[81] - as.numeric(summary(modernism$Adverb)[5])
stirringsstillH2 <- modernism$Preposition[81] - as.numeric(summary(modernism$Preposition)[5])
stirringsstill <- (stirringsstillH1 * stirringsstillH1) + (stirringsstillH2 * stirringsstillH2)
swannswayH <- modernism$wh.determiner[19] - as.numeric(summary(modernism$wh.determiner)[5])
swannswayH1 <- modernism$Possessive.wh.pronoun[19] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
swannsway <- (swannswayH * swannswayH) + (swannswayH1 * swannswayH1)
thesunalsorisesH <- max(modernism$Particle)
thesunalsorises <- thesunalsorisesH * thesunalsorisesH
tenderbuttonsL1 <- modernism$Personal.Pronoun[65] - as.numeric(summary(modernism$Personal.Pronoun)[2])
tenderbuttonsL2 <- modernism$Verb.past.tense[65] - as.numeric(summary(modernism$Verb.past.tense)[2])
tenderbuttonsH1 <- modernism$wh.determiner[65] - as.numeric(summary(modernism$wh.determiner)[5])
tenderbuttonsH2 <- modernism$Existential[65] - as.numeric(summary(modernism$Existential)[5])
tenderbuttonsL3 <- modernism$Preposition[65] - as.numeric(summary(modernism$Preposition)[2])
tenderbuttons <- (tenderbuttonsL3 * tenderbuttonsL3) + (tenderbuttonsL1 * tenderbuttonsL1) + (tenderbuttonsL2 * tenderbuttonsL2) + (tenderbuttonsH1 * tenderbuttonsH1) + (tenderbuttonsH2 * tenderbuttonsH2)
textsfornothingL <- modernism$Verb.past.tense[78] - as.numeric(summary(modernism$Verb.past.tense)[2])
textsfornothingH <- modernism$Possessive.wh.pronoun[78] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
textsfornothingH2 <- modernism$Existential[78] - as.numeric(summary(modernism$Existential)[5])
textsfornothingH1 <- modernism$Verb.non.3rd.person.singular.present[78] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
textsfornothing <- (textsfornothingL * textsfornothingL) + (textsfornothingH * textsfornothingH) + (textsfornothingH2 * textsfornothingH2) + (textsfornothingH1 * textsfornothingH1)
threelivesH <- modernism$Adverb[64] - as.numeric(summary(modernism$Adverb)[5])
threelives <- threelivesH * threelivesH
timeregainedH <- modernism$wh.determiner[25] - as.numeric(summary(modernism$wh.determiner)[5])
timeregainedH1 <- modernism$Possessive.wh.pronoun[25] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
timeregained <- (timeregainedH * timeregainedH) + (timeregainedH1 * timeregainedH1)
ulyssesL <- modernism$Verb.past.tense[59] - as.numeric(summary(modernism$Verb.past.tense)[2])
ulysses <- ulyssesL * ulyssesL
theunnamableL <- modernism$Verb.past.tense[71] - as.numeric(summary(modernism$Verb.past.tense)[2])
theunnamableH2 <- modernism$Existential[71] - as.numeric(summary(modernism$Existential)[5])
theunnamableH1 <- modernism$Verb.non.3rd.person.singular.present[71] - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
theunnamable <- (theunnamableL * theunnamableL) + (theunnamableH2 * theunnamableH2) + (theunnamableH1 * theunnamableH1)
thevalleyofdecisionH <- modernism$Possessive.wh.pronoun[217] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
thevalleyofdecision <- thevalleyofdecisionH * thevalleyofdecisionH
wattH <- modernism$Possessive.wh.pronoun[69] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
watt <- wattH * wattH
thewavesL <- modernism$Verb.past.tense[92] - as.numeric(summary(modernism$Verb.past.tense)[2])
thewavesH <- max(modernism$Verb.non.3rd.person.singular.present) - as.numeric(summary(modernism$Verb.non.3rd.person.singular.present)[5])
thewaves <- (thewavesL * thewavesL) + (thewavesH * thewavesH)
worstwardhoL <- modernism$Personal.Pronoun[82] - as.numeric(summary(modernism$Personal.Pronoun)[2])
worstwardhoL1 <- modernism$Verb.past.tense[82] - as.numeric(summary(modernism$Verb.past.tense)[2])
worstwardhoH1 <- max(modernism$Adjective.Superlative)
worstwardhoH2 <- modernism$Possessive.wh.pronoun[82] - as.numeric(summary(modernism$Possessive.wh.pronoun)[5])
worstwardhoH <- modernism$Adverb[82] - as.numeric(summary(modernism$Adverb)[5])
worstwardho <- (worstwardhoL * worstwardhoL) + (worstwardhoL1 * worstwardhoL1) + (worstwardhoH1 * worstwardhoH1) + (worstwardhoH2 * worstwardhoH2) + (worstwardhoH * worstwardhoH)
theyearsH <- modernism$Verb.past.tense[90] - as.numeric(summary(modernism$Verb.past.tense)[5])
theyears <- theyearsH * theyearsH
modernismresiduals <- c(absalom, acrosstheriverandintothetrees, asilaydying, theautobiographyofalicebtoklas, beckettstories, company, everybodysautobiography, afable, finneganswake, firstlove, fizzles, forwhomthebelltolls, thefugitive, theguermantesway, howitis, illseenillsaid, inabuddinggrove, intruderinthedust, theivorytower, kafka, labyrinths, makingofamericans, malonedies, themansion, thenatureofacrime, theoutcry, theprisoner, requiemforanun, thesecretagent, thesenseofthepast, sodomandgomorrah, thesoundandthefury, stirringsstill, swannsway, thesunalsorises, tenderbuttons, textsfornothing, threelives, timeregained, ulysses, theunnamable, thevalleyofdecision, watt, thewaves, worstwardho, theyears)
modernismtitles <- c("Absalom, Absalom", "Across The River and into the Trees", "As I Lay Dying", "The Autobiography of Alice B Toklas", "Beckett Stories", "Company", "Everybody's Autobiography", "A Fable", "Finnegans Wake", "First Love", "Fizzles", "For Whom the Bell Tolls", "The Fugitive", "The Guermantes Way", "How It Is", "Ill Seen Ill Said", "In A Budding Grove", "Intruder in the Dust", "The Ivory Tower", "Kafka", "Labyrinths", "Making of Americans", "Malone Dies", "The Mansion", "The Nature of a Crime", "The Outcry", "The Prisoner", "Requiem for a Nun", "The Secret Agent", "The Sense of the Past", "Sodom and Gomorrah", "The Sound and the Fury", "Stirrings Still", "Swann's Way", "The Sun Also Rises", "Tender Buttons", "Texts for Nothing", "Three Lives", "Time Regained", "Ulysses", "The Unnamable", "The Valley of Decision", "Watt", "The Waves", "Worstward Ho", "The Years")
modernismresiduals <- cbind.data.frame(modernismtitles, modernismresiduals)
faulkner <- sum(absalom, asilaydying, afable, intruderinthedust, themansion, requiemforanun, thesoundandthefury)
hemingway <- sum(acrosstheriverandintothetrees, forwhomthebelltolls, thesunalsorises)
stein <- sum(theautobiographyofalicebtoklas, everybodysautobiography, makingofamericans, tenderbuttons, threelives)
beckett <- sum(beckettstories, company, firstlove, fizzles, howitis, illseenillsaid, malonedies, stirringsstill, textsfornothing, theunnamable, watt, worstwardho)
joyce <- sum(finneganswake, ulysses)
proust <- sum(thefugitive, theguermantesway, inabuddinggrove, theprisoner, sodomandgomorrah, swannsway, timeregained)
james <- sum(theivorytower, theoutcry, thesenseofthepast)
conrad <- sum(thenatureofacrime, thesecretagent)
wharton <- thevalleyofdecision
woolf <- sum(thewaves, theyears)
modernistwriters <- c((faulkner / 7), (hemingway / 3), (stein / 5), (beckett / 12), (joyce / 2), (proust / 7), (james / 3), (conrad / 2), wharton, (woolf / 2), kafka, labyrinths)
modernistnames <- c("Faulkner", "Hemingway", "Stein", "Beckett", "Joyce", "Proust", "James", "Conrad", "Wharton", "Woolf", "Kafka", "Borges")
modernistwriters <- cbind.data.frame(modernistnames, modernistwriters)
plot(modernistwriters, main = "Mapping Outliers", xlab = "Names", ylab = "Grammatical Divergence")
realismtitles <- c("Balzac", "Eliot", "Zola", "Dostoevsky", "Crane", "Thackeray", "Austen")
balzac <- sum(anepisodeundertheterror, theballatsceaux, lettersoftwobrides, seraphita) / 4
eliot <- sum(theattackonthemill)
zola <- sum(thedeadwomanswish, hismasterpiece) / 2
dostoevsky <- sum(theinsultedandhumiliated, poorfolk, unclesdream, villageofstepanchikovo, whitenights) / 5
crane <- sum(thelittleregiment, theredbadgeofcourage, thethirdviolet) / 3
thackeray <- sum(lovelthewidower)
austen <- sum(mansfieldpark)
realismnovelists <- c(balzac, eliot, zola, dostoevsky, crane, thackeray, austen)
realism <- cbind.data.frame(realismtitles, realismnovelists)
plot(realism, main = "Grammatical Divergence amongst Realist Novelists", xlab = "Names", ylab = "Divergence (%)")
centuries %>% group_by(Year) %>%
summarise(sentence_summary = mean(Slength)) -> sentence_summary
sentence_summary %>% dygraph(main = "Sentence Lengths 1794 - 1988", ylab = "Sentence Length (Words)", xlab = "Year") %>% dyRangeSelector()
## Warning in strptime(xx, f <- "%Y-%m-%d %H:%M:%OS", tz = tz): unknown
## timezone 'zone/tz/2017c.1.0/zoneinfo/Europe/Dublin'
correlationdata <- cbind(centuries[,32], centuries[,15], centuries[,23], centuries[,17], centuries[,6], centuries[,28], centuries[,20], centuries[,26], centuries[,5], centuries[,30], centuries[,9])
colnames(correlationdata) <- c("Sentence Length", "Personal Pronouns", "Verb Past Tense", "Adverb", "Prepositions", "Wh Determiners", "Particles", "Verb 3rd Person Singular Pronouns", "Existentials", "Possessive wh Pronouns", "Adjective Suplerlatives")
res <- cor(correlationdata)
round(res, 2)
## Sentence Length Personal Pronouns
## Sentence Length 1.00 -0.29
## Personal Pronouns -0.29 1.00
## Verb Past Tense -0.30 0.54
## Adverb -0.06 0.38
## Prepositions 0.49 -0.39
## Wh Determiners 0.66 -0.40
## Particles -0.44 0.31
## Verb 3rd Person Singular Pronouns -0.13 0.05
## Existentials -0.13 0.15
## Possessive wh Pronouns 0.44 -0.50
## Adjective Suplerlatives 0.36 -0.26
## Verb Past Tense Adverb Prepositions
## Sentence Length -0.30 -0.06 0.49
## Personal Pronouns 0.54 0.38 -0.39
## Verb Past Tense 1.00 0.02 -0.24
## Adverb 0.02 1.00 -0.10
## Prepositions -0.24 -0.10 1.00
## Wh Determiners -0.38 -0.26 0.58
## Particles 0.52 0.12 -0.33
## Verb 3rd Person Singular Pronouns -0.56 0.06 -0.36
## Existentials 0.05 0.29 -0.23
## Possessive wh Pronouns -0.35 -0.31 0.41
## Adjective Suplerlatives -0.43 0.16 0.24
## Wh Determiners Particles
## Sentence Length 0.66 -0.44
## Personal Pronouns -0.40 0.31
## Verb Past Tense -0.38 0.52
## Adverb -0.26 0.12
## Prepositions 0.58 -0.33
## Wh Determiners 1.00 -0.62
## Particles -0.62 1.00
## Verb 3rd Person Singular Pronouns -0.09 -0.19
## Existentials -0.15 0.15
## Possessive wh Pronouns 0.70 -0.51
## Adjective Suplerlatives 0.25 -0.35
## Verb 3rd Person Singular Pronouns
## Sentence Length -0.13
## Personal Pronouns 0.05
## Verb Past Tense -0.56
## Adverb 0.06
## Prepositions -0.36
## Wh Determiners -0.09
## Particles -0.19
## Verb 3rd Person Singular Pronouns 1.00
## Existentials 0.01
## Possessive wh Pronouns -0.07
## Adjective Suplerlatives 0.09
## Existentials Possessive wh Pronouns
## Sentence Length -0.13 0.44
## Personal Pronouns 0.15 -0.50
## Verb Past Tense 0.05 -0.35
## Adverb 0.29 -0.31
## Prepositions -0.23 0.41
## Wh Determiners -0.15 0.70
## Particles 0.15 -0.51
## Verb 3rd Person Singular Pronouns 0.01 -0.07
## Existentials 1.00 -0.31
## Possessive wh Pronouns -0.31 1.00
## Adjective Suplerlatives -0.10 0.22
## Adjective Suplerlatives
## Sentence Length 0.36
## Personal Pronouns -0.26
## Verb Past Tense -0.43
## Adverb 0.16
## Prepositions 0.24
## Wh Determiners 0.25
## Particles -0.35
## Verb 3rd Person Singular Pronouns 0.09
## Existentials -0.10
## Possessive wh Pronouns 0.22
## Adjective Suplerlatives 1.00
#Wh. Determiners and sentence length positively correlated, possessive wh. pronouns + wh determiners positively correlated