Functions

Last week we wrote functions for variance, standard deviation, standard scores and correlation. I’m including those functions below. Take a minute to look over the funcitons and make sure you understand how they work:

VAR <- function(x){ mean((x-mean(x, na.rm=TRUE))^2)}
SD <- function(x) { sqrt(VAR(x))}
ZSCORE <- function(x) { (x - mean(x))/SD(x)}
CORR <- function(x,y) { mean(ZSCORE(x)*ZSCORE(y))}

Our functions have one important shortcoming – they do not have a wait of handling missing of NA values. For that reason, we can use R’s built in functions (var, sd, corr) instead.

Mammal Sleep Times

Why do some mammal sleep more than other mammals? We could begin to answer this question by looking at what kinds of mammals sleep more and what kinds of mammals sleep less.

Our data was taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.

First, load the data:

library(ggplot2)
data(msleep)

Next, let’s look at the data and pull up a description on the data:

View(msleep)
?msleep

Several of these columns are self explanatory, however you may want to read more about conseration status and REM sleep.

Data Transformation

Let’s talk a look at the body and brain sizes of these mammals by plotting body weights on the x-axis and brain weights on the y-axis.

plot(msleep$bodywt, msleep$brainwt, col="red")
text(msleep$bodywt, msleep$brainwt, msleep$name, cex=0.5)

You can see that all of the mammals except for the elephants are squeezed into the lower-left-hand corner of the graph. This isn’t just a problem in terms of seeing the data. When we analyze the data, including looking for correlations, using this scale, all of the non-elephants will be treated as being essentially the same size – small.

We can attempt to solve this problem by tranforming the data. For instance, here’s a graph of the square root of brain weight against the square root of body weight:

plot(sqrt(msleep$bodywt), sqrt(msleep$brainwt), col="red")
text(sqrt(msleep$bodywt), sqrt(msleep$brainwt), msleep$name, cex=0.5)

This helps spread out the animals more uniformly but doesn’t quite do the trick. We could get closer using a smaller power. Instead of raising weights to the 1/2 power (using a square root) we can raise them to the 1/100th power:

plot(msleep$bodywt^(1/100), msleep$brainwt^(1/100), col="red")
text(msleep$bodywt^(1/100), msleep$brainwt^(1/100), msleep$name, cex=0.5)

Logarithms

A more common transformation is the “log transformation”. Logarithms work as follows:

If \(10^x = y\) then \(log_{10} y = x\).

In English, \(log_{10} y\), is the power than you need to raise 10 to in order to get y.

So,

\[log_{10} 10 = 1\] \[log_{10} 100 = 2\] \[log_{10} 1000 = 3\] \[log_{10} 1 = 0\] \[log_{10} \frac{1}{10} = -1\] \[log_{10} \frac{1}{100} = -2\]

Now, let’s look at a graph of the log base 10 or brain weight graphed against the log base 10 of body weight:

plot(log10(msleep$bodywt), log10(msleep$brainwt), col="red")
text(log10(msleep$bodywt), log10(msleep$brainwt), msleep$name, cex=0.5)

This looks much like our 1/100th power tranformation graph (the log tranformation is essentially a \(\frac{1}{n}\) as \(n \rightarrow \infty\) tranformation) but it’s easier to interpret. For instance, the elephants fall between 3 and 4 on the log(body weight) axis so we know that they weight between \(10^3 = 1000\) and \(10^4 = 10,000\) kilograms.

New variables

Let’s add two new columns, log brain and body weights to our data set:

msleep$log10bodywt <- log10(msleep$bodywt)
msleep$log10brainwt <- log10(msleep$brainwt)

You can take another look at the data set and see that these have been added:

View(msleep)

Correlations of Subgroups

If we know try to calculate the correlation between log brain weight and log body weight, we’ll get NA as a result. This is because some of the brain weights are missing.

cor(msleep$log10bodywt, msleep$log10brainwt)

If you take a look at the correlation function (you can do so by running ?cor), you’ll see that we have the option of computing correlations using on complete observations:

cor(msleep$log10bodywt, msleep$log10brainwt, use="complete.obs")

We can also compare this correlation to the correlation between brain weight and body weight without the log transformation:

cor(msleep$bodywt, msleep$brainwt, use="complete.obs")

Q1: How do these correlations compare? Can you explain this?

Now, let’s split mammals into three subgroups: big animals, with body weights of at least 7 kg, small animals, with weights less than 0.5 kg and median animals with weights in between.

big.animals <- msleep[msleep$bodywt  >= 7, ] 
small.animals <- msleep[msleep$bodywt  < 0.5, ]
medium.animals <- msleep[msleep$bodywt  > 0.5 & msleep$bodywt  < 7, ]

cor(big.animals$log10bodywt, big.animals$log10brainwt, use="complete.obs")
cor(small.animals$log10bodywt, small.animals$log10brainwt, use="complete.obs")
cor(medium.animals$log10bodywt, medium.animals$log10brainwt, use="complete.obs")

Q2: How do the correlations of log brain weights and log body weights within these subgroups compare to the correlation you found using all of the mammals? Can you explain this?

Scatterplots and Correlations

Now let’s see how body weight predicts sleep time first by plotting the data and then by finding the correlation. We’ll use the ggplot2 package to make prettier plots.

Q3: Do larger mammals tend to sleep more or less? What else do these plots tell you.

ggplot(msleep) + geom_point(aes(log10bodywt, sleep_total, color=vore))
ggplot(msleep) + geom_point(aes(log10bodywt, sleep_total))+facet_wrap(~vore)

cor(msleep$log10bodywt, msleep$sleep_total, use="complete.obs")

Now, let’s try the same thing using brain weights. Q4: Do mammals with larger brains tend to sleep more or less?

ggplot(msleep) + geom_point(aes(log10brainwt, sleep_total, color=vore))
ggplot(msleep) + geom_point(aes(log10brainwt, sleep_total))+facet_wrap(~vore)

cor(msleep$log10brainwt, msleep$sleep_total, use="complete.obs")

Note that we have something of an interesting puzzle here. Body weight and brain weight are both connected to sleep time but it may be difficult to determine which factor is affecting sleep times, since brain and body weigths are so closely connected. Alternatively, sleep times could be determined by some factor we have not explored that is merely correlated with brain and body weights.

Q5: Propose a method we could use to determine whether brain or body weights are the driving factor in sleep times.

Brain Percentage

Monty Python Scientist: “If we increase the size of the penguin until it is the same height as the man and then compare the relative brain sizes, we now find that the penguin’s brain is still smaller. But, and this is the point, it is larger than it was.”

Do mammals that are 100 times heavier have brains that are 100 times heavier? To find out, let’s look at brainwt/bodywt versus log10bodywt.

Q6: Describe your findings.

ggplot(msleep) + geom_point(aes(log10bodywt, brainwt/bodywt, color=vore))

cor(msleep$brainwt/msleep$bodywt, msleep$log10bodywt, use="complete.obs")

We can remove the trend by brain percentage versus log body weight by looking at brainwt/bodywt^0.75 instead of brainwt/bodywt. Further we can look at the correlation between log body weight and brainwt/bodywt^0.75 to confirm that the relationship is now considerably weaker.

Q7: Which species have the highest values of brainwt/bodywt^0.75?

ggplot(msleep, aes(log10bodywt, brainwt/bodywt^0.75, color=vore, label=name))+geom_text()

cor(msleep$brainwt/msleep$bodywt^0.75, msleep$log10bodywt, use="complete.obs")

ggplot(msleep, aes(log10bodywt, brainwt/bodywt^0.75, color=order, label=name))+geom_point()

Q8: What do you see when you color the points based on the order of the mammals?