Exercise 1

Data in R is arranged in data.frames. This is a rectangular array where the columns represent the different variables or measures, and the rows represent observations. We’ll be looking at one such data frame that describes the effect of diet on early growth of chicks.

We can take a look at the first 6 rows of the data set using the head command

data(ChickWeight)
head(ChickWeight,10)
##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
tail(ChickWeight,10)
##     weight Time Chick Diet
## 569     67    4    50    4
## 570     84    6    50    4
## 571    105    8    50    4
## 572    122   10    50    4
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

What do you think the tail command will do? Try it out by adding a new line to the code chunk above. See if you can figure out how to get the first 10 rows instead of just the first 6. Hint: Take a look at the help for head - Help is available in the bottom right pane, or just use Google!

Another useful command is summary. As the name suggest, it summarises whatever object is passed to it. Try running summary on the ChickWeight dataset.

summary(ChickWeight)
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Why do you think the Chick and Diet columns are summarised in a different way to the weight and Time columns? You might want to replace this text with your answer so you can refer back to it later.

We’ll be using the ggplot2 package to do most of the plotting we need to do this semester. The following commands are the base of what we’ll need.

library(ggplot2)
ggplot(data = ChickWeight) +
  geom_point(mapping = aes(x = Time, y = weight))

This might seem like quite a bit to write! Let’s break it down.

We can add new aesthetics to this plot. e.g. we could colour the points by Diet by adding colour=Diet inside the aes() mapping function:

ggplot(data=ChickWeight) +
  geom_point(mapping=aes(x=Time, y=weight, shape=Diet))

Alternatively, you could change the shape of points using shape=Diet instead. Try changing the code above to do that.

You can also change the type of plot. e.g. you could try a line plot instead by switching to geom_line:

ggplot(ChickWeight) +
  geom_line(aes(x=Time, y=weight, colour=Diet))

Notice that I’ve dropped the data= and mapping=. This is because they’re the first parameters to the ggplot and geom_line functions, so we don’t have to name them if we don’t want to - it saves a bit of typing, but is maybe a little less readable unless you already know that?

This doesn’t look quite right though - instead of a line for each chick, we seem to have a line for each diet instead, with some weird saw-tooth shape. This is because it is linking up all the observations at each time point (i.e. all chicks within a diet) so we get the vertical lines. There’s a couple ways to fix this. One is to colour by Chick so that we have lines in distinct colours for each chick. Try this:

ggplot(ChickWeight) +
  geom_line(aes(x=Time, y=weight, colour=Chick))

The problem with the last plot is it’s hard to tell the different diets - are all the pinkish line chicks all one diet, or are they on different diets? Instead, we could keep colouring by Diet and add a grouping for Chick so we get separate lines:

ggplot(ChickWeight) +
  geom_line(aes(x=Time, y=weight, colour=Diet, group=Chick))

That’s better - now we can see that the third group seems to have the most growth, and the first group the least on average. The second group has an outlier with very low growth, while the rest of the chicks on Diet 2 seem to be doing better.

At this point we might want to look at making the plot a bit prettier. e.g. what if we want different colours? What if we want a title, or different axis labels?

ggplot(ChickWeight) +
  geom_line(aes(x=Time, y=weight, colour=Diet, group=Chick)) +
  scale_colour_manual(values=c("pink", "blue", "darkgreen", "orange")) +
  ylab("Weight (g)") +
  ggtitle("Weights of chicks on different feed")

Feel free to experiment!

Exercise 2: Exploratory Analysis of Trigger Point Data

This exercise is about doing some some exploratory data analysis on the number of Myofascial trigger points (palpable nodules of skeletal muscle that cause pain) in racing greyhounds from the Manawatu.

The data are from week 1 of a study designed to look at whether treating of trigger points via transcutaneous electrical nerve stimulation (TENS) was effective in reducing trigger points and provided some advantage to the greyhounds in terms of increased race performance. The data are from pre-treatment. The dogs were divided randomly into control and treatment groups, and the veterinarian was blinded to this assignment. The veterinarian then assessed each dog to evaluate the number of trigger points, and an associated pain score. The following variables were recorded

Variable Description
name The name of the dog
sex The sex of the dog (f for female, m for male)
num The number of trigger points
pain The overall pain score
treatment Whether the dog would be treated (1 for treated, 0 for control)

Start by reading the data into R and assigning to the variable tps with the following command

tps = read.csv("http://www.massey.ac.nz/~jcmarsha/227212/data/tps_week1.csv")
  1. Click on the tps that has appeared in the top right Environment tab and take a look through the data.

  2. Add a new code chunk below this paragraph and use the summary command to summarise the tps data set. You can add a new code chunk by clicking on Insert->R from the toolbar, or Ctrl-Alt-I on the keyboard.

summary(tps)
##                 name    sex         num             pain         treatment  
##  .com 99K 6114    : 1   f:16   Min.   : 0.00   Min.   : 0.00   Min.   :0.0  
##  Brady Z61 285    : 1   m:22   1st Qu.: 6.50   1st Qu.:16.00   1st Qu.:0.0  
##  Duke Z61 722     : 1          Median :12.00   Median :25.50   Median :0.5  
##  Elle 19D 5224    : 1          Mean   :12.55   Mean   :26.82   Mean   :0.5  
##  Ellwood 109B 6211: 1          3rd Qu.:15.75   3rd Qu.:34.75   3rd Qu.:1.0  
##  Emmett Z61 282   : 1          Max.   :32.00   Max.   :64.00   Max.   :1.0  
##  (Other)          :32
  1. You should have noticed that the treatment column has been summarised as if it is numeric, whereas it is supposed to be a grouping variable. We can take a look at a column by using $ to pull the column out of the data frame:
tps$treatment
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1. Try adding a new line under that one to pull out the sex of the dog. You should notice that as you type tps$, RStudio popped up a prompt to help provide a shortcut. If you didn’t notice it, type it in again, this time pausing after the $.
tps$sex
##  [1] m m m f m m f m m m f f f m m m m m m m m m m f f f f m f m f m f f f f m f
## Levels: f m
  1. Now that we know how to reference columns, we can do things with them. e.g. we could table up the treatment column using table:
table(tps$treatment)
## 
##  0  1 
## 19 19
  1. Or we could convert the treatment column to a factor or categorical variable with the factor command:
tps$group = factor(tps$treatment)
  1. Recall that the above is an assignment. What we’ve done is taken the expression on the right (which takes the treatment column from tps and converts it to a factor) and we’ve assigned it to the group column from tps. Note that group didn’t exist before - we’ve just created it now. Click on the tps data frame in the Environment to check you have the new column.

  2. We now want to look at plotting the data. In the same way as before, we give the information to ggplot. At this stage, we only want to look at the number of trigger points, so we’ll use a histogram for that

ggplot(tps) +
  geom_histogram(aes(x=num), bins=6)

  1. Try playing around with the number of bins to see how the shape of the histogram changes. How would you describe this shape? Is it symmetric or is it skew to the right or left?

  2. We can plot by group by using the fill aesthetic. This plots the histograms stacked on top of each other, using the same bins - i.e. it’s the same plot as before, but each box that makes up the histogram is coloured according to group.

ggplot(tps) +
  geom_histogram(aes(x=num, fill=group), bins=8)
  1. What do you think of the difference (if any) between control and treatment groups? Remember that this was before any treatment had been performed. Is this a problem for the experiment? You might want to add a note here.

  2. Try altering the colours and fixing up the axis labels, and maybe adding a title. The colours can be changed using scale_fill_manual, which changes the scale for the fill aesthetic.

ggplot(tps) +
  geom_histogram(aes(x=num, fill=group), bins=8)+
  scale_fill_manual(values=c("pink", "blue"))+
  ggtitle("Trigger points on dogs")

  1. Finally, use the Knit button to produce a HTML file combining the analysis, plots etc. You can also try knitting to Word as well if you like.

  1. You may need to install this package first - see the Packages tab in the bottom right (next to Help) to do so, then just click on Install, and type in ggplot2. Once installed, the library command will work.