REN R 213: Data Science for Resource Analysis

This document is an early working draft. It will be updated frequently

Author
Affiliation

Glen W. Armstrong

Univeristy of Alberta

Published

Last update: 2024-09-05

Motivation

The main reason I am offering this course is to introduce students to a way of thinking about data that will help them succeed in their university career, particularly in quantitative courses. I will also introduce some software tools that will allow you to implement that way of thinking. This course will allow you to

  • write better reports,

  • create better graphics,

  • save time on analysis and writing, and

  • understand how to use a scripting language to manage data, analyze the data, and communicate insights gained from those data.

Graduates knowledgeable in the techniques of data science are sought by employers in the renewable natural resources sector. You learn snippets of data science in some of the other courses you have taken or will take, but those courses tend to use well-formatted datasets unlike anything you will see in the real world.


My work involves collecting data from multiple sources in multiple formats, and gaining insights from those data to support better forest management. I have used many different applications over the years to manipulate and analyze those data, and to communicate insights gained from the analysis.

Spreadsheet apps such as Microsoft Excel have been important items in my toolbox. I suspect you, as students, use them quite often as well. In the last few years, I have found reliance on spreadsheet apps to be very limiting, and I find myself increasingly using scripting languages such as R, Python, and Julia. These languages have several advantages over spreadsheet apps.

1) They can process much larger datasets. The datasets used for forestry and other natural resources can consist of many millions of records (e.g. LiDAR points or forest permanent sample plot data).

2) It is much easier to create good and easy to replicate graphics with a scripting language and a good graphics package. It is possible to create good graphics using a spreadsheet app, but creating consistent graphics can be a challenge. Making the graphics “pretty enough” for publication can be extremely time consuming.

3) The scripting languages allow for “literate programming” in which a computer program integrates and a natural language (e.g. English) explaining the embedded computer source code and the results coming from it. Most scripting languages support the use of “notebooks”, which help make literate programming straightforward and natural. This document was written in Quarto using its literate programming features.

4) The use of notebooks to makes it very easy to modify code and simultaneously update the text describing the results.

5) …. some more stuff.

Introduction

Data science is an interdisciplinary field that uses statistical methods, algorithms, and computational tools to extract insights and knowledge from data. It combines techniques from mathematics, statistics, computer science, and domain expertise to analyze complex datasets, uncover patterns, make predictions, and inform decision-making.

R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund is a comprehensive introduction to using R for data science, covering the entire data analysis pipeline. The book focuses on the “tidyverse,” a collection of R packages designed for data manipulation, visualization, and analysis.

Ideally, management of renewable resources is based on a data-driven understanding of the resource and the values of stakeholders. The renewable resource that I am most familiar with is the boreal forest of Canada. I will use the development of yield curves from permanent sample plot data from Alberta forests to illustrate some of the techniques used in data science.

A generalized workflow for the application of data science to a problem is shown below (from R for Data Science (2e) )

The data science flow.

Yield of trembling aspen / white spruce mixedwood stands in Alberta

Many upland sites in the central mixedwood and dry mixedwood natural subregrions in Alberta are forested with stands of trembling aspen, white spruce, or both. The successional pathway in this type of forest is often described as follows.

After a disturbance such as a wildfire, the pioneer species is often trembling aspen. It is well-adapted to take advantage of increased light and nutrients following a disturbance. As the aspen stand grows, it creates for shade, allowing the shade tolerant white spruce to establish. The white spruce seeds could have been present in the soil seed bank or debris left after the disturbance. The white spruce continues to grow and eventually dominates. Trembling aspen in generally thought of as a relatively short-lived species. As the stand ages, the aspen component becomes smaller as trees die.

The purpose of this document is to demonstrate the use of R and the tidyverse to “wrangle” data and communicate information contained in the data. The example I use is the development of timber yield curves from permanent sample plot data. he graphic below summarizes how trembling aspen / white spruce mixedwood stands could develop over time. The connected line segments plotted in grey represent volume estimates from remeasured permanent sample plots. The black curves represent yield functions estimated using non-linear regression. The panels represent the hardwood (mostly aspen) and softwood (mostly white spruce) components of the stand. Total is the sum of hardwood and softwood volumes.

The creation of this graphic required many steps, detailed below.

Growth and yield models

Growth and yield models provide foresters with a way of projecting the development of a forest stand over time. Of the simplest yield models is a volume over age curve or table. These present stand volume (m3 ha-1 ) as a function of stand age (y).

The data used in the development of yield curves are collected from sample plots. At the minimum, diameter at breast height (dbh) and species are recorded for each tree in the plot. For fixed area plots, plot area and plot age are recorded. Plot age could be based on data collected from trees in the plot, disturbance history, or an interpreter call of the age of the stand the plot is located in. The tree measurements are compiled to provide an estimate of each measured tree, and aggregated to provide plot-level summaries. Sample plots can be temporary (measured just once) or permanent (measured and then remeasured at least once).

Permanent sample plot data

The Alberta Ministry of Forestry and Parks maintains a number of permanent sample plots (PSPs) to provide the basis for forest growth and yield models. These PSPs are fixed area plots with known locations that are remeasured in order to collect data on how individual trees develop over time. The data are stored in several files representing information on individual tree measurements on each tree at each measurement, plot-level data, plot measurement information, silvicultural treatment information, disturbance information, and more.

The version of the data that I’m working with contains 1,597,826 records in the tree measurement data file. Microsoft Excel is probably the analytical tool that most foresters are most comfortable with. Excel can handle 1,048,576 rows, so Excel is not the tool that should be used for handling data at the tree level.

One of the data files provided to me by Alberta Forestry and Parks contains PSP data compiled to the plot level. It is also a fairly large file with 193,501 records and using 66.6 MB of disk space.

These data contained compiled information on all of the PSPs managed by Alberta Forestry and Parks. I will use these data to try to understand how stand volume is related to age in trembling aspen / white spruce stands in the central and dry mixedwood natural subregions of Alberta.

We will use the tidyverse collection of R packages

Load libraries

Almost no data set comes in the best format for analysis. We often need to filter, reformat etc. Some of the packages in the tidyverse are very helpful. In order to use the tidyverse, we need to load the library. R responds by showing us which packages have been loaded, and tells us about any conflicts.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data

We need data in order to do analysis. The data relevant to a problem are often stored in one or more data files, or they could be streamed real time from a measurement device. It is important to keep a copy of the data in its original format. I never modify the original data set.

A very common format for data files is CSV (comma separated values). CSV is a good way to store data as they can be viewed with almost any editor. They are not stored in a proprietary format which may become unreadable at some point in time. I use function read_csv from the readr package of tidyverse to read the data from the CSV file to a “tibble” named compiled_data.

The glimpse() function can provide us with an overview of the tibble. The CSV file contains 119 columns. Alberta Forestry and Parks provided a description of the data stored in the columns.

I use the select function from dplyr to keep only the variables I might use in this analysis. I also use it to rename the strata and species columns to names I prefer.

# compiled_data <- read_csv("../data/compiled.csv", guess_max = 400000, 
#                            col_types = cols(col_factor(NULL))) |> #because of opening_number
#     select(company_plot_number, natural_subregion, 
#           stratum = strata, 
#           substrata, mmt_num, mmt_yr, age, scale, species_group = species, vol_0000,          vol_1307,
#           vol_1510, sphBH, ba, latitude, longitude) 

 compiled_data <- read_csv("D:/Shared/ds4ra/data/GOA_PGYICompiledstratified.csv",guess_max = 400000, 
                            col_types = cols(col_factor(NULL))) |> #because of opening_number
     select(plot, natsub, 
           stratum = strata, 
           substrata, mmt_num, mmt_yr, age, scale, species_group = species, vol_0000,
           vol_1307,
           vol_1510, sphBH, ba, latitude, longitude) 

glimpse(compiled_data)
Rows: 190,798
Columns: 16
$ plot          <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1…
$ natsub        <chr> "CM", "CM", "CM", "CM", "CM", "CM", "CM", "CM", "CM", "C…
$ stratum       <fct> HwSw, HwSw, HwSw, HwSw, HwSw, HwSw, HwSw, HwSw, HwSw, Hw…
$ substrata     <chr> "Pb", "Pb", "Pb", "Pb", "Pb", "Pb", "Pb", "Pb", "Pb", "P…
$ mmt_num       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mmt_yr        <dbl> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 19…
$ age           <dbl> 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, …
$ scale         <chr> "con_dec", "con_dec", "con_dec", "species", "species", "…
$ species_group <chr> "CO", "DE", "TO", "AW", "BW", "FB", "FD", "LT", "PB", "P…
$ vol_0000      <dbl> 166.9494054, 190.7733849, 357.7227903, 87.3050721, 3.550…
$ vol_1307      <dbl> 152.9442034, 181.9258691, 334.8700725, 84.0315940, 2.739…
$ vol_1510      <dbl> 141.9254737, 178.5291191, 320.4545928, 83.0169898, 2.015…
$ sphBH         <dbl> 691.699605, 306.324111, 998.023715, 108.695652, 39.52569…
$ ba            <dbl> 20.8144152, 22.0104235, 42.8248386, 8.8008214, 0.6250352…
$ latitude      <dbl> 54.40402, 54.40402, 54.40402, 54.40402, 54.40402, 54.404…
$ longitude     <dbl> -116.8341, -116.8341, -116.8341, -116.8341, -116.8341, -…

Data wrangling

We do not need all of these data for the task at hand. We will only keep a few of the columns, and get rid of the unneeded rows. I am only interested in the rows where scale == "con_dec" as these rows contain the volume compiled to species group. The other rows contain compiled data for individual species, which I will not consider at the time. For this analysis I will only consider “pure” trembling aspen, mixed trembling aspen / white spruce, and pure white spruce stands.

compact_data <- compiled_data |>
  filter(scale == "con_dec") |>
  filter(natsub %in% c("CM", "DM")) |>
  filter(stratum %in% c("Hw","HwSw","SwHw","Sw")) |>
  mutate(stratum = factor(stratum, levels = c("Hw","HwSw","SwHw","Sw"))) |>
  filter(! substrata %in% c("Pb","Bw")) |>
  mutate(
    stratrum = case_match(
      stratum,
      "Hw" ~ "Aw",
      "HwSw" ~ "AwSw",
      "SwHw" ~ "SwAw",
      .default = stratum
    ),
  )

Some of the PSP were established as a cluster of subplots within a stand. I chose to aggregate the subplots and treat them a single plot.

compact_data <- compact_data|>
  select(- substrata, - scale) |>
  filter(age >= 0) |>
  mutate(plot_number = floor(plot)) |>
  mutate(subplot_number = (plot %% 1) * 10 ) |>
  summarize(mmt_yr = mean(mmt_yr), age = mean(age), vol_0000 = mean(vol_0000),
            vol_1307 = mean(vol_1307),
            vol_1510 = mean(vol_1510), ba = mean(ba), sphBH = mean(sphBH), 
            latitude = mean(latitude),
            longitude = mean(longitude),
            natsub = first(natsub),
            stratum = first(stratum),
            .by = c(plot_number, mmt_num, species_group))

We are now working with a tibble with 2,286 rows and 14 columns.

glimpse(compact_data)
Rows: 2,628
Columns: 14
$ plot_number   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mmt_num       <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7,…
$ species_group <chr> "CO", "DE", "TO", "CO", "DE", "TO", "CO", "DE", "TO", "C…
$ mmt_yr        <dbl> 1960, 1960, 1960, 1965, 1965, 1965, 1968, 1968, 1968, 19…
$ age           <dbl> 79.66667, 79.66667, 79.66667, 84.66667, 84.66667, 84.666…
$ vol_0000      <dbl> 241.8281, 121.1210, 362.9490, 257.1231, 121.3032, 378.42…
$ vol_1307      <dbl> 218.26536, 115.33430, 333.59966, 234.26794, 116.27788, 3…
$ vol_1510      <dbl> 199.50734, 111.80036, 311.30770, 216.53853, 113.91324, 3…
$ ba            <dbl> 30.48874, 13.15864, 43.64738, 31.19399, 12.82497, 44.018…
$ sphBH         <dbl> 1186.6967, 231.6248, 1418.3215, 1107.1691, 178.7651, 128…
$ latitude      <dbl> 54.40421, 54.40421, 54.40421, 54.40421, 54.40421, 54.404…
$ longitude     <dbl> -116.8354, -116.8354, -116.8354, -116.8354, -116.8354, -…
$ natsub        <chr> "CM", "CM", "CM", "CM", "CM", "CM", "CM", "CM", "CM", "C…
$ stratum       <fct> SwHw, SwHw, SwHw, SwHw, SwHw, SwHw, SwHw, SwHw, SwHw, Sw…

Analysis

The summary function provides descriptive statistics for each of the numerical columns and counts for factor columns. These can be very helpful

compact_data <- compact_data |>
  mutate(natsub = as_factor(natsub)) |>
  mutate(species_group = as_factor(species_group))

summary(compact_data)
  plot_number      mmt_num       species_group     mmt_yr          age        
 Min.   :   1   Min.   : 1.000   CO:876        Min.   :1960   Min.   :  2.00  
 1st Qu.: 350   1st Qu.: 2.000   DE:876        1st Qu.:1987   1st Qu.: 13.00  
 Median :3012   Median : 3.000   TO:876        Median :1993   Median : 38.50  
 Mean   :3891   Mean   : 3.517                 Mean   :1994   Mean   : 57.51  
 3rd Qu.:9007   3rd Qu.: 5.000                 3rd Qu.:2001   3rd Qu.: 95.44  
 Max.   :9033   Max.   :10.000                 Max.   :2024   Max.   :211.50  
    vol_0000          vol_1307         vol_1510             ba         
 Min.   :  0.000   Min.   :  0.00   Min.   :  0.000   Min.   : 0.0000  
 1st Qu.:  0.996   1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.5758  
 Median : 42.929   Median : 11.84   Median :  6.196   Median : 8.2794  
 Mean   :115.955   Mean   : 96.93   Mean   : 89.368   Mean   :14.4203  
 3rd Qu.:204.211   3rd Qu.:172.67   3rd Qu.:156.410   3rd Qu.:27.3943  
 Max.   :841.505   Max.   :805.75   Max.   :786.144   Max.   :73.0081  
     sphBH          latitude       longitude      natsub    stratum    
 Min.   :    0   Min.   :53.03   Min.   :-119.4   CM:2469   Hw  :1059  
 1st Qu.:  400   1st Qu.:55.18   1st Qu.:-117.5   DM: 159   HwSw: 708  
 Median : 1255   Median :55.66   Median :-115.3             SwHw: 468  
 Mean   : 2327   Mean   :56.24   Mean   :-115.2             Sw  : 393  
 3rd Qu.: 2920   3rd Qu.:57.82   3rd Qu.:-113.1                        
 Max.   :34320   Max.   :59.19   Max.   :-110.1                        

Exploratory data analysis

Let’s do some graphics. This first graph is 15/10 total volume plotted against age

 ggplot(compact_data |> filter(species_group == "TO"), aes(x = age, y = vol_1510)) +
   geom_point() 

 # ggplot(compact_data |> filter(species_group == "TO"), aes(x = age, y = vol_1510)) +
 #   geom_hex() 

We can see some sort of pattern here, despite how much variability there is in the data. In general, volume increases with age, but probably in a nonlinear way.

We can add a smoothing curve to the datato perhaps gain some insight.

 ggplot(compact_data |> filter(species_group == "TO"), aes(x = age, y = vol_1510)) +
   geom_point() +
   geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The second graph splits out the CO and DE components. We are interested in both components.

ggplot(compact_data |> filter(species_group != "TO"), aes(x = age, y = vol_1510)) +
   geom_point() +
   facet_wrap(~ species_group) +
   geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The DE component appears to show declining volumes starting at 80 or so years of age.

Let’s recode the value of the species_group column to something more readable.

spgrp_levels <- c("Softwood", "Hardwood", "Total")
compact_data$species_group <- recode(compact_data$species_group, CO = "Softwood", 
                                     TO = "Total", DE = "Hardwood")

A facet grid is a good way to slice and dice the data. The geom_smooth function provides a hint about overall trends in the data.

 ggplot(compact_data |> filter(species_group != "Total"), aes(x = age, y = vol_1510) )+
   geom_point() +
   facet_grid(stratum~ species_group) +
   geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

So far we’ve been ignoring the fact that the data represent remeasured permanent sample plots. We can show how the plots develop over time. It seems pretty clear that the hardwood component of these stands start falling apart near 80 years of age.

ggplot(compact_data |> filter(species_group != "Total"), aes(x = age, y = vol_1510, 
                                                          group =  plot_number)) +
   geom_line() +
   facet_wrap(~ species_group) 

There are other ways of summarizing the data. The boxplot below shows the variation in volume estimates for 10 year wide age classes. This serves to emphasize the variability in the data.

The plot is also better annotated than the previous ones. Annotation is straightforward using ggplot.

ggplot(compact_data |> filter(species_group != "Total"), aes(x = age, y = vol_1510)) +
  geom_boxplot(aes(group = cut_width(age, 10))) +
  facet_wrap(~species_group) + 
  labs(
    y = bquote('Stand volume '~(m^3~ha^-1)),
    x = 'Age (y)',
    title = "Volume distributions by age from PSPs for mixed spruce/aspen stands",
    subtitle = "Hardwood volume component starts declining at about 80-100 years of age",
    caption = "Data from Alberta Forestry and Parks"
  )

hardwood_data <- compact_data |>
  filter(species_group == "Hardwood")

softwood_data <- compact_data |>
  filter(species_group == "Softwood")
ggplot(hardwood_data, aes(x = age, y = vol_1510)) +
   geom_line(aes(group = plot_number)) 

Statistical analysis

We will use nonlinear least squares to find the best fit for the function

\[v = a t^b \exp(-t /k)\ ,\] where \(v\) is stand volume (m\(^3\) ha\(^{-1}\)), \(t\) is stand age (y), and \(a\), \(b\), and \(k\) are model coefficients. This function was chosen because it allows for both increasing and declining volumes with age.

library(nlstools)

'nlstools' has been loaded.
IMPORTANT NOTICE: Most nonlinear regression models and data set examples
related to predictive microbiolgy have been moved to the package 'nlsMicrobio'
model1 <- nls(vol_1510 ~ a * age^b * exp(-age / k), 
        data=hardwood_data,
        start = list(a = 2.96e-4, b = 3.64, k = 30))
        
model1
Nonlinear regression model
  model: vol_1510 ~ a * age^b * exp(-age/k)
   data: hardwood_data
        a         b         k 
8.094e-06 4.931e+00 1.702e+01 
 residual sum-of-squares: 4114291

Number of iterations to convergence: 24 
Achieved convergence tolerance: 1.102e-06
summary(model1)

Formula: vol_1510 ~ a * age^b * exp(-age/k)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a 8.094e-06  1.061e-05   0.763    0.446    
b 4.931e+00  3.847e-01  12.818   <2e-16 ***
k 1.702e+01  1.327e+00  12.822   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 68.65 on 873 degrees of freedom

Number of iterations to convergence: 24 
Achieved convergence tolerance: 1.102e-06

We’ll not go into much detail about the statistics, but the \(b\) and \(k\) coefficients are statistically significant. We will plot the estimated curve along with the input data.

pred_data_hardwood <- tibble(age = seq(0, 220, length.out = 221))

pred_vol <- predict(model1, pred_data_hardwood)

pred_data_hardwood <- mutate(pred_data_hardwood, vol_1510 = pred_vol)
pred_data_hardwood <- mutate(pred_data_hardwood, species_group = "Hardwood")
ggplot(pred_data_hardwood, aes(x = age, y = vol_1510)) +
  geom_line(colour = "black", 
            size = 2) +
  geom_line(data = hardwood_data, aes(group = plot_number), colour = "grey") + 
  labs(
    y = bquote('Stand volume '~(m^3~ha^-1)),
    x = 'Age (y)',
    title = "Predicted yield curve for hardwood component",
    subtitle = "Hardwood volume component starts declining at about 80-100 years of age",
    caption = "Data from Alberta Forestry and Parks"
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The same stuff repeated for the softwood component.

ggplot(softwood_data, aes(x = age, y = vol_1510)) +
   geom_point() 

 model1 <- nls(vol_1510 ~ a * age^b * exp(-age / k), 
        data=softwood_data,
        start = list(a = 2.96e-4, b = 3.64, k = 30))
        
model1
Nonlinear regression model
  model: vol_1510 ~ a * age^b * exp(-age/k)
   data: softwood_data
        a         b         k 
1.257e-06 4.649e+00 3.638e+01 
 residual sum-of-squares: 3143143

Number of iterations to convergence: 36 
Achieved convergence tolerance: 1.788e-06
summary(model1)

Formula: vol_1510 ~ a * age^b * exp(-age/k)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a 1.257e-06  2.014e-06   0.624    0.533    
b 4.649e+00  4.190e-01  11.095  < 2e-16 ***
k 3.638e+01  4.428e+00   8.216 7.56e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 60 on 873 degrees of freedom

Number of iterations to convergence: 36 
Achieved convergence tolerance: 1.788e-06
pred_data_softwood <- tibble(age = seq(0, 220, length.out = 221))

pred_vol <- predict(model1, pred_data_softwood)

pred_data_softwood <- mutate(pred_data_softwood, vol_1510 = pred_vol)
pred_data_softwood <- mutate(pred_data_softwood, species_group = "Softwood")


ggplot(pred_data_softwood, aes(x = age, y = vol_1510)) +
  geom_line(colour = "black", 
            size = 2) +
  geom_line(data = softwood_data, aes(group = plot_number), colour = "grey") + 
  labs(
    y = bquote('Stand volume '~(m^3~ha^-1)),
    x = 'Age (y)',
    title = "Predicted yield curve for softwood component",
    caption = "Data from Alberta Forestry and Parks"
  )

Some more wrangling and graphic presentation

This show a little bit of data wrangling using pivot_wider and pivot_longer to put the data together in a way that makes creation of the final plot easier. We also calculate total volume as the sum of the hardwood and softwood components.

pred_data <- rbind(pred_data_hardwood,pred_data_softwood)

pred_data_2 <- pivot_wider(pred_data, names_from = species_group, values_from = vol_1510)
pred_data_3 <- pred_data_2 |>
  mutate(Total = Hardwood + Softwood)
pred_data_4 <- pivot_longer(pred_data_3, names_to = "species_group", cols = c("Hardwood", "Softwood", "Total"),
                                                 values_to = "vol_1510")
pred_data_4 <- arrange(pred_data_4, species_group, age)
ggplot(pred_data, aes(x = age, y = vol_1510)) +
  geom_line() +
  facet_wrap(~ species_group)

ggplot(compact_data, aes(x = age, y = vol_1510)) +
   geom_line(aes(group = plot_number), color = "grey") +
   geom_line(data = pred_data_4) +
  facet_wrap(~ species_group)  + 
  labs(
    y = bquote('Stand volume '~(m^3~ha^-1)),
    x = 'Age (y)',
    title = "Yield curves for mixedwood trembling aspen / white spruce stands",
    subtitle = "Estimated from permanent sample plots using nonlinear regression",
    caption = "Data from Alberta Forestry and Parks"
  )

ggsave("pretty_yield_curve.png")
Saving 7 x 5 in image