Imagine that you have a Philosophy homework that consist in read more than a hundred books, and group then according with the affinities of each book. Despite of be a real hard homework, there are several methods to do that such as author. We put all Plato’s books in a group, all Aristotle books in a group so on, but this aproache don’t give a lot of relevant information. Another choice is according with theme such as ethics, metaphysics, aesthetics, there’s a much more of value in this aproache, but there are books that deal with several topics. Another one is group according the importance of the words in the book, and this aproache will be covered in this post.
22 philosophy authors were selected which work is avaliable in the Gutenberg project. Namely: Aristotle, Augustine, Berkeley, Descartes, Hegel, Hobbes, Hume, Kant, Leibniz, Locke, Machiavelli, Marcus Aurelius, Marx, Mill, Nietzsche, Pascal, Plato, Rousseau, Russell, Schopenhauer, Spinoza and Thomas. All books avaliable in the package gutenberg was downloaded. Resulting in 177 books, here is a sample:
| Author | Title |
|---|---|
| Augustine | The City of God, Volume II |
| Schopenhauer | The Essays of Arthur Schopenhauer; On Human Nature |
| Spinoza | Ethics |
| Nietzsche | Thoughts out of Season, Part I |
| Plato | Ion |
| Mill | Considerations on Representative Government |
| Machiavelli | Discourses on the First Decade of Titus Livius |
| Mill | Auguste Comte and Positivism |
| Plato | Euthyphro |
| Aristotle | The Ethics of Aristotle |
After that download this data, we need to preprocess to avoid a misleading in our analysis, in this step was done:
Stop words removal
Special character removal
Word lowering
Contraction replacement
Punctuation removal
String lemmatization
And then this data was transformed in a Corpus, the in a weigthed by term frequency matrix. This matrix have a hundreds of columns, so was applied a dimension reduction technique. First was applied a principal component analysis in order to capture at least 80% variance and this number is reached with 82 components. Then was applied a tSNE algorithm, was estimated that the number of steps that minimize the K-L divergence is 1250.
So was tried perplexity between 1 up to 58 and the perplexity that make the cluster more visible is 2.
Perplexity = 2
In the final step of this analysis the was used a k-means clustering in the tSNE two dimensional space, and the number of clusters was determined in 12 using the elbow method.
Kmeans
If you want to get more details of the study or reproduce by yourself check it out in the github repository (https://github.com/AraujoAMF/philosophy-clustering)
After applied all these techiques the final result is this plot grouping each book in the tSNE space. And you can explore by yourself the relationships between the books. Each color is a cluster. So the same color mean same cluster. And you can see the author of the book written, and if you click in a point there is the book title. And explore the similarities of these books.