Evolution of language in literature throughout ages
The goal of our project is to compare how English language evolved in four corresponding centuries, from 17th century to 20th century. To achieve this aim we collected books from specific centuries from Project Guttenberg. The books we collected were not necessarily written in English - some of them were written in other languages and then translated into English. We believe that even though these books were not originally written in English, their translations reflect English language at the corresponding age.
The initial idea was to get over at least a dozen of titles for each century based on the list of the most popular books for specific period - www.goodreads.com. However for 20th century we encountered the problem with the availability of titles because of copyrights. That’s why books from 20th century are not the most popular ones but only these we could get.
First look at the datasets - wordclouds
To create wordclouds, we used R library wordcloud2. We chose words that occur at least 100 times and plot them into a wordcloud, to give a quick glance on how language evolved throughout the ages.
17th century
18th century
19th century
20th century
We can see that many of the popular words used in 17th century are also the most popular words in later ages - e.g. ‘say’, ‘go’, ‘see’, ‘will’. However, there are some words in 17th century that seem outdated - like ‘thou’, ‘shall’ or ‘thy’. On the other hand, we can see that some words gained popularity over time - e.g. ‘like’, which was one of the most popular words in 20th century, was less frequent in use before 20th century.
Word frequencies
To confirm our conclusions, we plotted word frequencies. We can clearly see that compared to 20th century, in 17th century such words as ‘thee’, ‘knight’ and ‘christ’ were used more commonly - and reversely, in 20th century words ‘tea’, ‘archer’, ‘mrs’, ‘miss’ were much more common. Interestingly, most popular words in 17th century, such as ‘go’, ‘say’ and ‘will’ were also the most popular ones in 20th century.
Similarly, when we compare 18th and 19th century to 20th century, we can see that some words became less popular (like ‘clergy’, ‘army’, ‘cristo’) and some words gained popularity (e.g. ‘archer’, ‘businessman’, ‘beaufort’) but the most popular stayed the same (‘come’, ‘always’, ‘return’).
Total Number of Words vs Count of Unique Words for each Century
The eighteenth century is a clear winner in the competition of total number of words. It is also the best in the case of number of unique words. However due to the limited size of our dataset this conclusion cannot be seen as a fully realiable statement.
Count of Unique Words to All Words
In case of the proportion of count of unique words to all words for each century the first thing which has to be said that the plot is highly connected with the previous one. We can see the relation that centuries with the lowest number of count of words are the ones with the highes proportion score in this case. It brings us to interesting and somehow obvious pattern for probably all languages. Which is that the number of unique words in language is limited. Therefore after some point it does not matter if we increase the size of dataset in case of words, the number of unique will remain the same but the proportion will obviously go down.
Parts of Speech Analysis
Part of Speech and the proportion in which they are present in the text might give us a valuable information about its character. This can be explained with example of comparison of speeches with one coming from somebody who is an extrovert and the other person being an introvert. We can expect that extrovert personality uses more adjectives than introvert one. The next example might be the case when someone talks a lot about itself or other people. So in some way we can detect a narcisstic personality looking at significantly high share of nouns. Unfortunately the results of our analysis at the dataset we used do not bring us highly differntiated results. The most interesting conclusion is the fact that 20th century is the best in all the categories besides ‘Others’. What it might indicate is that Stanford NLP library we used for this task got the best performance on this part of dataset because of its most modern character.
Sentiment analysis
Sentiment Type to Sum of All Words
The conclusion regarding to this plot is similar to the previous one. Here the 20th century is the winner in positive in negative classes, having the lowest share in case of neutrality. The neutrality in sentiment analysis is the group with the highest chances that something which is not really neutral will be classified as such because of lack of information for the algorithm.
Most common positive and negative words
We also anaylysed how much each word contributed to each sentiment in 17th and 20th century (we are mostly interested in these particular ages as they are the first and the last one). We noticed that in 17th century some of the most common negative words revolved around death (‘death’, ‘die’) but they are not amongs 10 most frequently used words in 20th century.
In case of positive words, we can see that the words that concerned religion - like ‘heaven’ or ‘god’ were popular in 17th century, but not in 20th century.
Is there a progressive simplification of English language?
Ogden’s Basic English is a simplified language created by Charles K. Ogden in 1930. It is based on a theory that 90% of concepts in English language can be explained by only 850 simple words.
Full list of Ogden’s Basic English words can be found here: [http://ogden.basic-english.org/]:http://ogden.basic-english.org/
Use of basic words in literature - analysis of all words
To verify the hypothesis that English has been simplified throughout the ages, we checked how many words used in literature were basic in the terms of Ogden’s basic English. There has been an decrease in usage of basic words in 18th century compared to 17th century, but it has increased in 19th and 20th centuries. However, there does not seem to be a strong trend.
Use of basic words in literature - analysis of unique words
To make our analysis more thorough, we also checked how many of unique words that were used in literature were basic words. We can see that the biggest percentage of basic words used was during 20th century - however, there is no clear trend and we cannot draw any conclusions about simplification of English language.
Additional analysis
Finally, we tried to get to know a little more about authors of the books that we included in our analysis.