Data in R is arranged in data.frame
s. This is a rectangular array where the columns represent the different variables or measures, and the rows represent observations. We’ll be looking at one such data frame that describes the effect of diet on early growth of chicks.
We can take a look at the first 6 rows of the data set using the head
command
data(ChickWeight)
head(ChickWeight,10)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
tail(ChickWeight,10)
## weight Time Chick Diet
## 569 67 4 50 4
## 570 84 6 50 4
## 571 105 8 50 4
## 572 122 10 50 4
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
What do you think the tail
command will do? Try it out by adding a new line to the code chunk above. See if you can figure out how to get the first 10 rows instead of just the first 6. Hint: Take a look at the help for head
- Help is available in the bottom right pane, or just use Google!
Another useful command is summary
. As the name suggest, it summarises whatever object is passed to it. Try running summary
on the ChickWeight
dataset.
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
Why do you think the Chick
and Diet
columns are summarised in a different way to the weight
and Time
columns? You might want to replace this text with your answer so you can refer back to it later.
We’ll be using the ggplot2
package to do most of the plotting we need to do this semester. The following commands are the base of what we’ll need.
library(ggplot2)
ggplot(data = ChickWeight) +
geom_point(mapping = aes(x = Time, y = weight))
This might seem like quite a bit to write! Let’s break it down.
Firstly, the library
command tells R that it should load the ggplot2
package up so we can use it1.
The next command sets up the plot. We first tell ggplot
which data frame we are wanting to plot.
The geom_point()
bit then adds a layer of points to the plot - one for each observation in the data frame. They’ll be placed according to the aesthetics, which define how our data should be mapped to features on the plot (e.g. x, y axes, colour, shape etc). We use the aes()
command to define that we want Time
on the x axis and weight
on the y axis.
We can add new aesthetics to this plot. e.g. we could colour the points by Diet by adding colour=Diet
inside the aes()
mapping function:
ggplot(data=ChickWeight) +
geom_point(mapping=aes(x=Time, y=weight, shape=Diet))
Alternatively, you could change the shape of points using shape=Diet
instead. Try changing the code above to do that.
You can also change the type of plot. e.g. you could try a line plot instead by switching to geom_line
:
ggplot(ChickWeight) +
geom_line(aes(x=Time, y=weight, colour=Diet))
Notice that I’ve dropped the data=
and mapping=
. This is because they’re the first parameters to the ggplot
and geom_line
functions, so we don’t have to name them if we don’t want to - it saves a bit of typing, but is maybe a little less readable unless you already know that?
This doesn’t look quite right though - instead of a line for each chick, we seem to have a line for each diet instead, with some weird saw-tooth shape. This is because it is linking up all the observations at each time point (i.e. all chicks within a diet) so we get the vertical lines. There’s a couple ways to fix this. One is to colour by Chick
so that we have lines in distinct colours for each chick. Try this:
ggplot(ChickWeight) +
geom_line(aes(x=Time, y=weight, colour=Chick))
The problem with the last plot is it’s hard to tell the different diets - are all the pinkish line chicks all one diet, or are they on different diets? Instead, we could keep colouring by Diet
and add a grouping for Chick
so we get separate lines:
ggplot(ChickWeight) +
geom_line(aes(x=Time, y=weight, colour=Diet, group=Chick))
That’s better - now we can see that the third group seems to have the most growth, and the first group the least on average. The second group has an outlier with very low growth, while the rest of the chicks on Diet 2 seem to be doing better.
At this point we might want to look at making the plot a bit prettier. e.g. what if we want different colours? What if we want a title, or different axis labels?
ggplot(ChickWeight) +
geom_line(aes(x=Time, y=weight, colour=Diet, group=Chick)) +
scale_colour_manual(values=c("pink", "blue", "darkgreen", "orange")) +
ylab("Weight (g)") +
ggtitle("Weights of chicks on different feed")
Feel free to experiment!
This exercise is about doing some some exploratory data analysis on the number of Myofascial trigger points (palpable nodules of skeletal muscle that cause pain) in racing greyhounds from the Manawatu.
The data are from week 1 of a study designed to look at whether treating of trigger points via transcutaneous electrical nerve stimulation (TENS) was effective in reducing trigger points and provided some advantage to the greyhounds in terms of increased race performance. The data are from pre-treatment. The dogs were divided randomly into control and treatment groups, and the veterinarian was blinded to this assignment. The veterinarian then assessed each dog to evaluate the number of trigger points, and an associated pain score. The following variables were recorded
Variable | Description |
---|---|
name | The name of the dog |
sex | The sex of the dog (f for female, m for male) |
num | The number of trigger points |
pain | The overall pain score |
treatment | Whether the dog would be treated (1 for treated, 0 for control) |
Start by reading the data into R and assigning to the variable tps
with the following command
tps = read.csv("http://www.massey.ac.nz/~jcmarsha/227212/data/tps_week1.csv")
Click on the tps
that has appeared in the top right Environment
tab and take a look through the data.
Add a new code chunk below this paragraph and use the summary
command to summarise the tps
data set. You can add a new code chunk by clicking on Insert->R from the toolbar, or Ctrl-Alt-I on the keyboard.
summary(tps)
## name sex num pain treatment
## .com 99K 6114 : 1 f:16 Min. : 0.00 Min. : 0.00 Min. :0.0
## Brady Z61 285 : 1 m:22 1st Qu.: 6.50 1st Qu.:16.00 1st Qu.:0.0
## Duke Z61 722 : 1 Median :12.00 Median :25.50 Median :0.5
## Elle 19D 5224 : 1 Mean :12.55 Mean :26.82 Mean :0.5
## Ellwood 109B 6211: 1 3rd Qu.:15.75 3rd Qu.:34.75 3rd Qu.:1.0
## Emmett Z61 282 : 1 Max. :32.00 Max. :64.00 Max. :1.0
## (Other) :32
treatment
column has been summarised as if it is numeric, whereas it is supposed to be a grouping variable. We can take a look at a column by using $ to pull the column out of the data frame:tps$treatment
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sex
of the dog. You should notice that as you type tps$
, RStudio popped up a prompt to help provide a shortcut. If you didn’t notice it, type it in again, this time pausing after the $
.tps$sex
## [1] m m m f m m f m m m f f f m m m m m m m m m m f f f f m f m f m f f f f m f
## Levels: f m
table
:table(tps$treatment)
##
## 0 1
## 19 19
factor
command:tps$group = factor(tps$treatment)
Recall that the above is an assignment. What we’ve done is taken the expression on the right (which takes the treatment
column from tps
and converts it to a factor) and we’ve assigned it to the group
column from tps
. Note that group
didn’t exist before - we’ve just created it now. Click on the tps
data frame in the Environment
to check you have the new column.
We now want to look at plotting the data. In the same way as before, we give the information to ggplot
. At this stage, we only want to look at the number of trigger points, so we’ll use a histogram for that
ggplot(tps) +
geom_histogram(aes(x=num), bins=6)
Try playing around with the number of bins to see how the shape of the histogram changes. How would you describe this shape? Is it symmetric or is it skew to the right or left?
We can plot by group by using the fill
aesthetic. This plots the histograms stacked on top of each other, using the same bins - i.e. it’s the same plot as before, but each box that makes up the histogram is coloured according to group.
ggplot(tps) +
geom_histogram(aes(x=num, fill=group), bins=8)
What do you think of the difference (if any) between control and treatment groups? Remember that this was before any treatment had been performed. Is this a problem for the experiment? You might want to add a note here.
Try altering the colours and fixing up the axis labels, and maybe adding a title. The colours can be changed using scale_fill_manual
, which changes the scale for the fill
aesthetic.
ggplot(tps) +
geom_histogram(aes(x=num, fill=group), bins=8)+
scale_fill_manual(values=c("pink", "blue"))+
ggtitle("Trigger points on dogs")
Knit
button to produce a HTML file combining the analysis, plots etc. You can also try knitting to Word as well if you like.You may need to install this package first - see the Packages tab in the bottom right (next to Help) to do so, then just click on Install, and type in ggplot2
. Once installed, the library
command will work.↩