We ran an introductory workshop on plotting in R
and ggplot2
that you can find here. In the second part of that workshop, there was a lot of challenges that we didn’t get round to doing. Here, we’re going to use the same dataset to see if you can remember the basic processes involved in getting a scatter plot up and running in ggplot2
.
So, here, you’re going to make some scatter plots for the hills
dataset that is found in the MASS
package of R
.
This is an R-markdown
document and working with them is slightly different from working with the R
-scripts we used in the workshop. Both R
-scripts (suffix \*.R
) and R-markdown
documents (suffix \*.Rmd
) are plain-text files (so you can view them in notepad
or similar) and they need to be processed by R
before your results / graphs are visible. The easiest way to work with Rmarkdown documents is inside R-studio
, however.
To process an R-markdown
file in R-studio
:
Make sure the file is open (in the source
panel; Use File -> Open File
if not);
Either click the Run -> Run All
dialogue at the top of the source
panel (this intersperses compiled figures with your raw text/code);
Or preferably, knit the whole document to an .html
file using the Knit
dialogue (again this should be at the top of the source
panel in Rstudio: shortcut - Ctrl-Shift-K or Cmd-Shift-K).
In the document, we have challenges interspersed with plain-text and formatting information. You are going to write your own code to solve the challenges. However, you should be able to knit this document at any point (even without having attempted the challenges) to produce a viewable .html
file. So try running Knit
right now.
# Load the packages that are used later in the document
library(MASS) # Contains the `hills` dataset
library(dplyr) # Data manipulation tools
library(ggplot2) # Plotting tools
We import the dataset from the MASS
package and rearrange it slightly.
# Loads the `hills` dataset from the `MASS` package
data(hills)
tidy_hills <- mutate(hills, peak = row.names(hills))
str(tidy_hills)
## 'data.frame': 35 obs. of 4 variables:
## $ dist : num 2.5 6 6 7.5 8 8 16 6 5 6 ...
## $ climb: int 650 2500 900 800 3070 2866 7500 800 800 650 ...
## $ time : num 16.1 48.4 33.6 45.6 62.3 ...
## $ peak : chr "Greenmantle" "Carnethy" "Craig Dunain" "Ben Rha" ...
head(tidy_hills)
From here, you should use the tidy_hills
dataset.
You made a number of scatter plots using the Anscombe dataset in the initial workshop. Try and plot the record-time versus race distance for the Scottish hills dataset.
If you need a refresher, check the examples in the help-pages for ggplot
and geom_point
.
The steps (for ggplot2) are:
Define your dataset
Map columns of that dataset to aesthetic entities in your chart
Define the type of chart you want to generate
# Your code goes here:
# Plot time on the y-axis, and distance on the x-axis
# .. End of your code
# Challenge 1 Solution:
ggplot(data = tidy_hills, aes(x = dist, y = time)) +
geom_point() +
xlim(0, NA) +
ylim(0, NA)
Having filled in the code, you can now either run that code or knit the document to generate an .html file.
This time try and make a scatter plot of time-against-height, but add axis labels to the x- (height) and y- (time) axes.
Check the help-page for labs
if you need to work out how to add titles or axis-labels.
If you want to include the units for the height-climbed and the record-time, have a look at the help-page for hills
(? hills
).
# Your code goes here:
# - Plot time on the y-axis and height on the x-axis and add labels for the x-
# and y- axes.
# .. End of your code
# Challenge 2 Solution:
ggplot(data = tidy_hills, aes(x = climb, y = time)) +
geom_point() +
labs(
x = "Height climbed / feet",
y = "Record time / min"
) +
xlim(0, NA) +
ylim(0, NA)
This time, try to plot height-against-distance for the hill runs but encode the time-taken using the size of the scatter-plot points.
Hint: If you look at the help-page for geom_point you can see that it “… understands the following aesthetics: x
, y
, …, shape
, size
, stroke
”. This means that you can make a mapping of a column from your dataset into the size
attribute of the corresponding points just as you would map a column into the x
-position attribute.
# Your code goes here:
# - Plot distance on the x-axis and height on the y-axis
# - Use the time-taken to determine the size of the corresponding points
# .. End of your code
# Challenge 3 Solution:
ggplot(data = tidy_hills, aes(x = dist, y = climb, size = time)) +
geom_point() +
labs(
x = "Race distance / miles",
y = "Height climbed / feet",
size = "Record time / min"
) +
lims(
x = c(0, NA),
y = c(0, NA)
)
Now knit the whole document together.
Hopefully those three graphs came out as you wanted. Feel free to modify them to make them a bit prettier - there’s lots of suggestions for how to do this on the ggplot2 website. For example, you could change the axis ranges, the point colours, the point styles, or the theme of the charts.