Descriptive statistics and visualizing your data

1 Assessment

This session is assessed using MCQs (questions highlighted below). The actual MCQs can be found on the BS1040 Blackboard site under Assessments and Feedback/Data analysis MCQs. The deadline is listed there and on the front page of the BS1040 blackboard site. This assessment contributes 2.5% of module marks. You will receive feedback on this assessment after the submission deadline.

2 R scripts

One of the great advantages of R is reproducibility of your analysis (here’s a nice explanation to read in your own time). This is a major concern of modern biological research. But for the minute, lets imagine a simpler example. Imagine you have worked several years on an important project. The end result is actually the culmination of a large number of different analyses. You did this in a GUI based program e.g. minitab (man I love those pressy buttons). You send the paper off for review. 3 months later (and years after you did your first analysis), one reviewer suggests that you should base your results on medians rather than means. Oh lordy, you now have to repeat all your analysis, trying to remember in which order you pressed all those lovely buttons.

Now lets assume that rather than heeding the alluring but ultimately unfufilling siren call of minitab (or spss or graph prism or the countless other products from companies trying to take money off you), you instead stayed on the true path of R. Well all your analyses would be in an R script. Then all you have to do would be to find the line of code where you asked for means and change that to medians.

In Rstudio go to File and then New File and then R script. An empty sheet appears. I tend to write all my commands in there. Then I highlight them and press Run to test them. Save this script often. Use this handy tip, it will save you a load of headaches. If something crashes the first thing the demonstrators will ask is “where is your script”.

3 Build in datasets

R has a load of build in datasets. You can have a look at them here or by typing

data()

As I said in the first session we will try to use them whenever possible. This is just a convienence for teaching. It also allows this course to be useful as a standalone. Think about it. Next year you want to check something but you’ve long since lost the dataset or I’ve taken it off the webapge. When you need to load your own datasets, refer back to the first session.

4 Descriptive statisics

Lets have a look at the ChickWeight dataset.

View(ChickWeight)

This should open up a viewer to let you have a quick look at the data. It has the weight of a given chick on a given diet on a given day. If you have a large dataset and you just want a quick peak head() can be useful.

Can you get some various measures of central tendency and variability of the weight of the chicks? Time to get those googlefu skills out. Let me get you started. Put “R mean” into google. Behold all of human knowledge. A little tip, when you are searching for something R related, stay away from https://www.rdocumentation.org/ or https://stat.ethz.ch/R-manual/. These are technical help pages and are incredibly unhelpful.

When you want to look at a specific variable in a dataset we use the $ sign. So ChickWeight$Diet is referring to the diet variable.

Task: Calculate the mean, median, standard deviation and interquartile range of the weight of the chicks sampled in ChickWeight

Blackboard MCQ: Whats the mean, median, standard deviation and interquartile range of the weight of the chicks sampled in ChickWeight

5 Visualing your data

Really the first thing you should do when you collect a new dataset is look at it. If its more than a few datapoints you’ll do this through figures. This is done before actual statistical analysis takes place. Think chick weight is related to age plot chick weight against age. Does it look good? Okay maybe you’re right. So how do we plot things in R?

5.1 ggplot

R has pretty robust graphic commands build in. They are fine, but more and more people are using a specific package called ggplot2. It has postives and negatives. For example, lets make a simple histogram.

hist(ChickWeight$weight)

library(ggplot2)
ggplot(data = ChickWeight, aes(weight)) + geom_histogram()

Ignore the size of the bins for the minute, that is just due to the defaults each commands uses. Rather I want you to focus on the length of the commands. hist (base R) wins easily. (Note: if that didn’t work, have you installed ggplot2 yet?)

But wait, what about something a bit more complicated. Imagine if we wanted to plot a scatterplot of the weight of chickens over time, separated by which diet they were on. By the way, I don’t expect you to really follow the two code examples below. Rather I want you to see that ggplot is shorter for more complicated graphs.

Base colored scatter plot example:

plot(weight ~ Time, data = subset(ChickWeight, Diet == "1"))
points(weight ~ Time, col = "red", data = subset(ChickWeight, Diet == "3"))
legend(x = "topleft", c("1", "3"), title = "Diet", col = c("black", "red"), pch = c(1,
    1))

ggplot2 colored scatter plot example:

ggplot(subset(ChickWeight, Diet %in% c("1", "3")), aes(x = Time, y = weight, color = Diet)) +
    geom_point()

ggplot wins!

So although ggplot has a steeper learning curve, its perhaps worth learning from the beginning as it makes life easier later. This webpage goes into much more detail about teaching ggplot to beginners, when you have some time.This is a nice little book for learning how to do a multitude of graphs in base or ggplot if you’re interested.

5.1.1 Making and understanding your first ggplot graph

Lets go back to the ToothGrowth dataset. Have a quick read of what its about. You want to know whether vitamin C supplementation increases tooth growth in guinea pigs (as a bioassay for efficiency). So first thing you do is plot the data in a scatterplot.

ggplot(ToothGrowth, aes(x = dose, y = len)) + geom_point()

The first thing you did there was used a function called ggplot, which basically says draw a graph. The first argument is usually what the dataset is, in this case ToothGrowth. Then you lay down the aesthetics using aes. That means you are telling the computer what is the relation between the data and the graph e.g. that dose is the x-value. Try run ggplot(ToothGrowth,aes(x=dose,y=len)), what happens? Why? Because you haven’t told R yet how you want the data represented, its geometry. In this case, we want a scatter plot so we use geom_point.

Blackboard MCQ: Based on the scatterplot you just made (shown above), what effect does increasing vitamin C dosage have on tooth growth?

This is an ugly graph. Luckily, ggplot graphs are completely customisable. The main point to remember is that a ggplot graph is build up of layers, with the data and the aesthetics being the first layer. Lets add some more layers to pretty it up. I hate the grey background. ggplot has some pre-defined themes to help me with that. We should also put some axis labels.

ggplot(ToothGrowth, aes(x = dose, y = len)) + geom_point() + xlab("Dosage of vitamin C") +
    ylab("Tooth length") + theme_bw()

If you don’t understand what each part of the command is doing here, spend some time on it and then ask. This is important.

5.1.2 Barcharts bad, boxplots good

From looking at the dataset (what do you mean you haven’t?). There is another variable, supp. The researchers clearly thought it was important how the vitamin C was delivered. Lets see if they were right.

ggplot(ToothGrowth, aes(x = dose, y = len, colour = supp)) + geom_point() + xlab("Dosage of vitamin C") +
    ylab("Tooth length") + theme_bw()

Before we get on to what the data means. Can you see what the difference in the command was, compared to the last graph? Think about why the change is in aes and not anywhere else.

The type of supplementation (supp) seems to be having an effect. Lets use a different type of graph to look at this more closely. Scatterplots are very good for looking at trends in the raw data. However you can often see the pattern more clearly if you represent some measure of central tendency and variation.

Bar charts (see below are) were pretty common in biology.

However, more and more people are realising their limitations. The main point is they can hide a lot of information. A more standard and (simpler to code) way to present this type of data is as a boxplot. Below is a boxplot comparing orange juice to vitc tablet effects on tooth growth.

I’ll give you the code below. But before you just mindlessly cut and paste, can you create this figure by altering the code of the scatterplot? (Hint:geom_boxplot() in place of geom_point(). Your x-axis has also changed).

Did you try?

Sure? Spoilers ahead.

Okay

ggplot(ToothGrowth, aes(x = supp, y = len, colour = supp)) + geom_boxplot() + xlab("supplementation type") +
    ylab("Tooth length") + theme_bw()

So you can start to see the power of ggplot. Compare the code for the scatterplot and the boxplot. The syntax of every type of command is the same, so you only have to learn it once.

One last improvement, we could make is have a boxplot with the raw data also displayed.

Extra challenge:I’m not going to give you the code for this, but can you work out how to do it? (Hint: A graph can have more than one geometry )

Blackboard MCQ: From the above graph, does orange juice or vitC tablet increase tooth growth more on average?

Blackboard MCQ: What’s the geom for a histogram in ggplot?