Visualising the Composition of the Human Body in R

Despite my penchant for coding constantly in R, I'm primarily a biochemistry student. Today I'm going to hit two birds with one stone by using R to create a diagram for my notes (which are themselves written in $ \LaTeX $).

The diagram shows the chemical composition (percentage atoms) of the human body. Besides the more common elements ($ C, H, O, N, P $) there are also a myriad of elements with specialised but essential roles in the body. Selenium - a sulpherlike element often found in wretched smelling compounds - is present in the oft-ignored 21st amino-acid[1] selenocysteine, which as a constituent of the enzyme deiodinase helps regulate the metabolic rate. Cobalt is the workhorse of cofactor B12, which acts as a free-radical generator in order to rearrange a certain fat breakdown product. Despite cobalt's minute concentration $ (1:10^{6}) $, a B12-deficiency[2] usually proves fatal within three years.

Some of the elements that are labelled as not having a positive health effect do play some role in human metabolism - for example rubidium is treated much like the $ K^{+} $ (pottasium) ion, and lead is a powerful inhibitor of some blood-forming[3] enzymes. As per the legend title, these are not considered as having a positive health effect; their absence from the body is not missed.

Code Details

I used the readHTMLTable function in the XML package to get the data, which was taken from Wikipedia. Wikipedia has a wide range of well formatted tables, particularly for statistics by country. In this case, the statistics are slightly anomalous: the proportions sum to slightly over 1. My guess is that either the page authors, or myself, fudged one of the numbers by an order of magnitude. Maybe silicon; I'm surprised that so much silicon is present in human body (notwithstanding the obvious implant-jokes).

The HTML table is parsed in a reasonably workable format, with only one value that needs manual tweaking. The downside to this method of importing tables is that the resulting object has poorly-named components; X$NULL$Positive health role in mammals[7] looks more like ASCII art than legitimate R code.


suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(parallel))
suppressPackageStartupMessages(library(XML))

l <- base::length

rawHuman <- readHTMLTable(
    "http://en.wikipedia.org/wiki/Composition_of_the_human_body",
    as.data.frame = FALSE

)[2]

I'm trying to cut down on my use of the imperative programming paradigm; in my opinion the step-wise approach to programming leads to long, complex programs. These can usually be re-factored to use lambda expressions, parallel versions of the apply function, and one liners. The performance boost is just a bonus.

The elementPlot function takes the $ log_{10} $ of the composition percentage by element, adds a point to the plot, and draws a label near this point. By near, I mean that a random amount is added the position of the point (x, y), slightly randomising the label positions. I generated the graphic eight or nine times to get an overlap-free image. A genetic algorithm or calculus-heavy optimisation function would be neater, but would take much longer to write.[4]

The graphic is also colour coded: blue means that the element is biologically beneficial in mammals, and red means it is not. The raw Wikipedia data was not organised by a similarly simple TRUE/FALSE scheme. Instead, it provided a Yes/No value, sometimes followed by an explanation. I used the str_match function to try match the word “Yes” in the data; if it occurred the element was coloured blue, otherwise it was coloured red.


atomicPercent <- rawHuman$`NULL`$`Atomic percent`
atomicPercent <- atomicPercent[ind <- which(atomicPercent != "")]
atomicPercent[l(atomicPercent)] <- 1e-17

elementPlot <- function() {
    ggplot(data = data.frame(index = 1:l(atomicPercent), element = rawHuman$`NULL`$Element[ind], 
        atomicPercent = as.numeric(atomicPercent), healthRole = unlist(mclapply(X = (role <- rawHuman$`NULL`$`Positive health role in mammals[7]`)[ind], 
            FUN = function(x) {

                if (is.na(unlist(str_match(x, "Yes")))) {
                  "No/Not Currently Known"
                } else {
                  "Yes"
                }

            })))) + geom_point(aes(x = index, y = log10(atomicPercent), colour = healthRole), 
        size = 4) + xlab("") + ylab("log10 % of overall composition") + geom_text(aes(x = index + 
        sample((-5:5)/10, , size = l(atomicPercent), replace = TRUE), y = log10(atomicPercent) + 
        sample((-7:7)/10, size = l(atomicPercent), replace = TRUE), label = element), 
        size = 3.3) + opts(title = "The Elemental Composition of the Human Body", 
        plot.title = theme_text(size = 15, face = "bold"), axis.line = theme_blank(), 
        axis.ticks = theme_blank(), axis.text.x = theme_blank()) + scale_colour_discrete(name = "Does the element have a positive \n health role in Mammals?")
}

[1] The pedant in me must add the qualifier that selenocysteine is the 21st mammalian proteogenic amino acid
[2] Brought on by pernicious anaemia, which hinders the absorbsion of Vitamin B12 from the diet
[3] Enzymes involved in heme-synthesis, to be specific.
[4] Sorry about the unreadable ggplot2 code; R markdown ignores my common-lisp like indentation style by deleting whitespace. In this case it halfed the length of the block of code, doubling unreadability