Joining data

Continuing from where you left off in lecture 1. Go to SUNLearn and download the dataset called rainfall.csv and save it in your data folder for this project and then do the following.

  1. Import the dataset and call it d.rain

  2. Now look at the data by simply typing the name of the dataset (d.rain) into the console and hit enter. Now do the same with dat. What is the difference?

  3. Coming back to the rainfall dataset. You will see that the dataset contains the year, local and the total rainfall for the growing season months (Apr to Sep). Calculate the total rainfall during the growing season and call it rain (note that I keep the variable names as short as possible)

##    year       local  Apr May  Jun   Jul  Aug  Sep  rain
## 1: 2016     Caledon 53.0 6.0 60.5  79.0 42.0 33.5 274.0
## 2: 2016     Darling 27.5 6.5 65.0  57.5 42.5 26.0 225.0
## 3: 2016  Langgewens 47.2 4.4 97.8  73.8 51.6 43.8 318.6
## 4: 2016 Porterville 12.0 5.0 64.0  52.0 47.0 34.0 214.0
## 5: 2016  Riversdale  0.0 0.0 29.6  68.0 34.4 63.3 195.3
## 6: 2016   Tygerhoek 15.8 1.7 42.0 105.7 29.8 73.5 268.5
  1. Now join the rainfall dataset with the trail data and call the new dataset dat.
##    year   local rep plot     yld N_plant N_tdress N_spray N_tot yld_max
## 1: 2016 Darling   1    1  530.30       0        0       0     0 3059.27
## 2: 2016 Darling   1    2 1157.89      25       25       0    50 3059.27
## 3: 2016 Darling   1    3 1656.37      25      105       0   130 3059.27
## 4: 2016 Darling   1    4  946.80      25       50       0    75 3059.27
## 5: 2016 Darling   1    5 3059.27      25      135       0   160 3059.27
## 6: 2016 Darling   1    6 1630.23      25      165       0   190 3059.27
##     yld_norm  Apr May Jun  Jul  Aug Sep rain
## 1:  17.33420 27.5 6.5  65 57.5 42.5  26  225
## 2:  37.84857 27.5 6.5  65 57.5 42.5  26  225
## 3:  54.14265 27.5 6.5  65 57.5 42.5  26  225
## 4:  30.94856 27.5 6.5  65 57.5 42.5  26  225
## 5: 100.00000 27.5 6.5  65 57.5 42.5  26  225
## 6:  53.28820 27.5 6.5  65 57.5 42.5  26  225

Plotting data

Now we get to the fun part! Lets plot! Lets start by plotting the rainfall data.

Rainfall plots

Using the the d.rain dataset we can now visualise the rainfall data to understand it better.Lets start by plotting the total growing season rainfall for all to the locations as a density plot. For help with getting started, have a look at the ggplot cheat sheet. Call your plot p1.

This looks OK but we can do much better - the background should be white, horisontal axis title should be expanded, the vertical axis does not start at 0 and the horisonal axis should begin and end at the min and max rainfall.

Lets start by adding a theme so that we can remove the grey background. I use theme_bw() for all of my plots since it both prints well and works great with a projector. Note the base_size = 12 with the function, the function would work without this but then it would revert to all of the defaults. However, changing the size of the text is tipically necessary.

p1.1 <- #name of the next version of the plot
p1 +
  theme_bw(base_size = 11)
p1.1 #plots version p1.1 of the plot

OK, now we should do something about the axis titles and maybe we can add a title to the plot while we are at it. This can be done with the labs function.

p1.2 <- #name of the next version of the plot
p1.1 + 
  labs(x = "Total growing season rainfall (mm)",y = "Density", title = "Growing season rainfall (2016-2019)")
p1.2 #plots version p1.1 of the plot

Ok, now can fix the axises. This is a bit more tricky, especially if we also want to specify the break as the axis labels. Here we specify 10 breaks on the x axis with breaks = scales::pretty_breaks(n = 10), expand = c(0, 0) specifies that the plot should start and stop with the data i.e. there is no “padding” around the plots, and limits = c(min(d.rain$rain),max(d.rain$rain)) specifies that the plot should start and stop at the min and max rainfall values. Note that I have to specify the upper limit of the y axis manually by looking at it, in this case it is limits = c(0,0.006).

p1.3 <- #name of the next version of the plot
p1.2 +
    scale_x_continuous(breaks = scales::pretty_breaks(n = 10),expand = c(0, 0),limits = c(min(d.rain$rain),max(d.rain$rain))) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 5),expand = c(0, 0),limits = c(0,0.006))
p1.3

Now, what if we wanted so see the rainfall distribution for each location individually? For that purpose we use the facit_wrap function. In this instance we only want to facet by local and thus the first argument is bank as indicated by the . with the local following after the ~. Note that I add the facet to p1.2 and not p1.3 since it includes the limits specified according to the combined plot and thus the facets will not plot correctly. Try adding the facets to p1.3 to see what happens.

p1.4 <- 
p1.2 +
  facet_wrap(.~local)
p1.4

Lets specify the limits for the facet based on a visual inspection of p1.4, not the limits must be true for all of the plots. Not that since the plots are smaller the number breaks has to be less as well otherwise not all of the text would fit.

p1.5 <- 
p1.4 +
   scale_x_continuous(breaks = scales::pretty_breaks(n = 5),expand = c(0, 0),limits = c(min(d.rain$rain),max(d.rain$rain))) +
   scale_y_continuous(breaks = scales::pretty_breaks(n = 5),expand = c(0, 0),limits = c(0,0.0105))
p1.5

Now lets take it even further by adding a vertical dotted line which shows the average rainfall for each location during this four year period. For this you will have to draw on your learnings from last week as well.

r.mean <- d.rain[, list(mean = mean(rain)), by = .(local)]

p1.5 +
 geom_vline(data = r.mean, aes(xintercept=mean),linetype =2)

Please note that you do not have to build your plots in a stepwise fashion such as this example, you can do all of it in one go.

p1 <- 
  ggplot() +
  geom_density(data= d.rain,aes(x = rain)) +
   geom_vline(data = r.mean, aes(xintercept=mean),linetype =2) +
     scale_x_continuous(breaks = scales::pretty_breaks(n = 5),
                        expand = c(0, 0),limits = c(min(d.rain$rain),max(d.rain$rain))) +
     scale_y_continuous(breaks = scales::pretty_breaks(n = 5),
                        expand = c(0, 0),limits = c(0,0.0105)) +
  labs(x = "Total growing season rainfall (mm)",y = "Density", title = "Growing season rainfall by location (2016-2019)") +
  facet_wrap(.~local) +
  theme_bw(base_size = 11)
  p1

The last step is to save your data. Its imporant that you always specify the dimensions and resolution of your image.

ggsave("Figures/Figure1_rainfall.png",p1,units = "cm",width = 16, height = 16/3*2,dpi=1200)

Yield plots

Ok, now its your turn.

  1. Plot the density of actual wheat yields by location. Stretch goal: remove the labels of the vertical axis.

  2. Now plot the relationship between yield and total nitrogen application by location using a scatter plot (point) with the year indicated by the colour of the point.