Introduction

Imagine that you have a Philosophy homework that consist in read more than a hundred books, and group then according with the affinities of each book. Despite of be a real hard homework, there are several methods to do that such as author. We put all Plato’s books in a group, all Aristotle books in a group so on, but this aproache don’t give a lot of relevant information. Another choice is according with theme such as ethics, metaphysics, aesthetics, there’s a much more of value in this aproache, but there are books that deal with several topics. Another one is group according the importance of the words in the book, and this aproache will be covered in this post.

Methodology

22 philosophy authors were selected which work is avaliable in the Gutenberg project. Namely: Aristotle, Augustine, Berkeley, Descartes, Hegel, Hobbes, Hume, Kant, Leibniz, Locke, Machiavelli, Marcus Aurelius, Marx, Mill, Nietzsche, Pascal, Plato, Rousseau, Russell, Schopenhauer, Spinoza and Thomas. All books avaliable in the package gutenberg was downloaded. Resulting in 177 books, here is a sample:

Author Title
Augustine The City of God, Volume II
Schopenhauer The Essays of Arthur Schopenhauer; On Human Nature
Spinoza Ethics
Nietzsche Thoughts out of Season, Part I
Plato Ion
Mill Considerations on Representative Government
Machiavelli Discourses on the First Decade of Titus Livius
Mill Auguste Comte and Positivism
Plato Euthyphro
Aristotle The Ethics of Aristotle

After that download this data, we need to preprocess to avoid a misleading in our analysis, in this step was done:

And then this data was transformed in a Corpus, the in a weigthed by term frequency matrix. This matrix have a hundreds of columns, so was applied a dimension reduction technique. First was applied a principal component analysis in order to capture at least 80% variance and this number is reached with 82 components. Then was applied a tSNE algorithm, was estimated that the number of steps that minimize the K-L divergence is 1250.

Itercosts So was tried perplexity between 1 up to 58 and the perplexity that make the cluster more visible is 2.

Perplexity = 2

Perplexity = 2

In the final step of this analysis the was used a k-means clustering in the tSNE two dimensional space, and the number of clusters was determined in 12 using the elbow method.

Kmeans

Kmeans

If you want to get more details of the study or reproduce by yourself check it out in the github repository (https://github.com/AraujoAMF/philosophy-clustering)

Result

After applied all these techiques the final result is this plot grouping each book in the tSNE space. And you can explore by yourself the relationships between the books. Each color is a cluster. So the same color mean same cluster. And you can see the author of the book written, and if you click in a point there is the book title. And explore the similarities of these books.