Lecture 2 - Joining data and plots

Author

Jan C Greyling

Review

Continuing from where you left off in lecture 1. Go to the Teams site, download the dataset called rainfall.csv, save it in your project’s data folder, and do the following.

  1. Create a new script, call it Lecture 2 and save it in your Scripts folder.

  2. Import and redo

  • Import the rainfall data and call it d.rain

  • Repeat what you did yesterday.

Code
remove(list=ls()) #clear all from memory
require(data.table)
require(ggplot2)

dat <- fread("Data/wheat_student_data.csv")
d.rain <- fread("Data/rainfall.csv")

dat[, N_tot := N_plant + N_tdress + N_spray]

dat <- dat[!is.na(yld),]
dat[, yld_max := max(yld), by = .(year,local,rep)]
dat[, yld_norm := yld/yld_max*100 ]
  1. Now look at the data by simply typing the name of the dataset (d.rain) into the console and hit enter. Now do the same with dat. What is the difference?

  2. Coming back to the rainfall dataset. You will see that the dataset contains the year and local for the growing season months (Apr to Sep). Calculate the total rainfall during the growing season and call it rain_tot (note that I keep the variable names as short as possible)

Code
d.rain[, rain_tot :=  Apr+May+Jun+Jul+Aug+Sep]
head(d.rain)
   year       local  Apr May  Jun   Jul  Aug  Sep rain_tot
1: 2016     Caledon 53.0 6.0 60.5  79.0 42.0 33.5    274.0
2: 2016     Darling 27.5 6.5 65.0  57.5 42.5 26.0    225.0
3: 2016  Langgewens 47.2 4.4 97.8  73.8 51.6 43.8    318.6
4: 2016 Porterville 12.0 5.0 64.0  52.0 47.0 34.0    214.0
5: 2016  Riversdale  0.0 0.0 29.6  68.0 34.4 63.3    195.3
6: 2016   Tygerhoek 15.8 1.7 42.0 105.7 29.8 73.5    268.5

Plotting data

Now we get to the fun part! Lets plot! Let’s start by plotting the rainfall data.

Rainfall plots

Using the dat dataset, we can now visualise the rainfall data to understand it better. Lets start by plotting the total growing season rainfall for all locations as a density plot. For help with getting started, look at the ggplot cheat sheet. Call your plot p1.

Code
p1 <- 
  ggplot() +
  geom_density(data= d.rain,aes(x = rain_tot)) #its better to specify the data in geom_density and not ggplot() if you want plot the data of multiple datasets on the same plot.
p1

This plot looks OK, but we can do much better - the background should be white, the horizontal axis-title should be expanded, the vertical axis does not start at 0, and the horizontal axis should begin and end at the min and max rainfall.

Let’s start by adding a theme to remove the grey background. I use theme_bw() for all of my plots since it both prints well and works great with a projector. Note the base_size = 12 with the function, the function would work without this, but then it would revert to all of the defaults. However, changing the size of the text is typically necessary.

Code
p1.1 <- #name of the next version of the plot
p1 +
  theme_bw(base_size = 11)
p1.1 #plots version p1.1 of the plot

OK, now we should do something about the axis titles and maybe we can add a title to the plot while we are at it. This can be done with the labs function.

Code
p1.2 <- #name of the next version of the plot
p1.1 + 
  labs(x = "Total growing season rainfall (mm)",y = "Density", title = "Growing season rainfall (2016-2019)")
p1.2 #plots version p1.1 of the plot

OK, now we can fix the axes. This step is tricky, especially if we also want to specify the break as the axis labels. Here we specify 10 breaks on the x-axis with breaks = scales::pretty_breaks(n = 10), expand = c(0, 0) specifies that the plot should start and stop with the data i.e. there is no “padding” around the plots, and limits = c(min(d.rain$rain),max(d.rain$rain)) specifies that the plot should start and stop at the min and max rainfall values. I do the same thing on the y-axis but with fewer breaks Note that I have to manually specify the y-axis’s upper limit by looking at it; in this case, it is limits = c(0,0.006).

Code
p1.3 <- #name of the next version of the plot
p1.2 +
    scale_x_continuous(breaks = scales::pretty_breaks(n = 10),expand = c(0, 0),limits = c(min(d.rain$rain),max(d.rain$rain))) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 5),expand = c(0, 0),limits = c(0,0.008))
p1.3

Now, what if we wanted so see the rainfall distribution for each location individually? For that purpose we use the facit_wrap function. In this instance, we only want to facet by local; thus, the first argument is blank as indicated by the . with the local following after the ~. Note that I add the facet to p1.2 and not p1.3 since it includes the limits specified according to the combined plot; thus, the facets will not plot correctly. Try adding the facets to p1.3 to see what happens.

Code
p1.4 <- 
p1.2 +
  facet_wrap(.~local)
p1.4

Now lets take it even further by adding a vertical dotted line which shows the average rainfall for each location during these four years. For this, you will also have to draw on your learnings from last week.

Code
r.mean <- d.rain[, list(mean = mean(rain_tot)), by = .(local)]

p1.4 +
 geom_vline(data = r.mean, aes(xintercept=mean),linetype =2)

Please note that you do not have to build your plots in a stepwise fashion such as in this example; you can do all of it in one go.

Code
p1 <- 
  ggplot() +
  geom_density(data= d.rain,aes(x = rain_tot)) +
   geom_vline(data = r.mean, aes(xintercept=mean),linetype =2) +
     scale_x_continuous(breaks = scales::pretty_breaks(n = 5),
                        expand = c(0, 0),limits = c(min(d.rain$rain_tot),max(d.rain$rain_tot))) +
     scale_y_continuous(breaks = scales::pretty_breaks(n = 5),
                        expand = c(0, 0),limits = c(0,0.015)) +
  labs(x = "Total growing season rainfall (mm)",y = "Density", title = "Growing season rainfall by location (2016-2019)") +
  facet_wrap(.~local) +
  theme_bw(base_size = 11)
  p1

The last step is to save your data. It’s important that you always specify the dimensions and resolution of your image.

Code
ggsave("Figures/Figure1_rainfall.png",p1,units = "cm",width = 16, height = 16/3*2,dpi=1200)

Joining data

  1. Now join the rainfall dataset with the trail data and call the new dataset dat.
Code
setkey(dat,year,local)
setkey(d.rain,year,local)
dat <- merge(dat,d.rain, all.x = T)
head(dat)
   year   local rep plot     yld N_plant N_tdress N_spray N_tot yld_max
1: 2016 Darling   1    1  530.30       0        0       0     0 3059.27
2: 2016 Darling   1    2 1157.89      25       25       0    50 3059.27
3: 2016 Darling   1    3 1656.37      25      105       0   130 3059.27
4: 2016 Darling   1    4  946.80      25       50       0    75 3059.27
5: 2016 Darling   1    5 3059.27      25      135       0   160 3059.27
6: 2016 Darling   1    6 1630.23      25      165       0   190 3059.27
    yld_norm  Apr May Jun  Jul  Aug Sep rain_tot
1:  17.33420 27.5 6.5  65 57.5 42.5  26      225
2:  37.84857 27.5 6.5  65 57.5 42.5  26      225
3:  54.14265 27.5 6.5  65 57.5 42.5  26      225
4:  30.94856 27.5 6.5  65 57.5 42.5  26      225
5: 100.00000 27.5 6.5  65 57.5 42.5  26      225
6:  53.28820 27.5 6.5  65 57.5 42.5  26      225
  1. What does the setting all.x = T mean when using the merge function. Hint: Type ?merge into the console.

Assignment 2

OK, now its your turn.

  1. Plot the density of actual wheat yields by location. Stretch goal: remove the labels of the vertical axis.

  1. Now plot the relationship between yield and total nitrogen application by location using a scatter plot (point) with the year indicated by the point’s colour.