In this example, we will use linear regression model to predict aboveground biomass using dbh and total tree height. the data is derive from paper which contain, foliage (fol), Branch (bch), stem bark (brk), stem wood (stm), and total aboveground biomass(tab) along with diameter at breast height (dbh), live crown length (lcl), and total tree height (tht).
setwd("E://Training//AbovegroundBiomass")
df <- read.csv("NayBormann.csv")
we will focus on total aboveground biomass in this example, first thing we would like to do qith data at hand is to look at a few observation, summary statistics, and create some exploratory figures. R function “head()” will print first six rows of the data whereas, function ‘summary()’ will provide basic statistics of all variables in the dataset.
head(df)
## tree dbh tht lcl fol bch brk stm tab
## 1 4 45.8 31.6 11.1 37 85 164 953 1239
## 2 5 23.1 22.1 11.2 5 7 32 188 232
## 3 12 54.9 33.3 16.5 52 111 241 1401 1805
## 4 28 36.7 32.1 12.2 23 56 115 665 858
## 5 33 58.5 36.0 17.1 71 148 294 1706 2220
## 6 34 43.0 30.9 14.1 28 31 140 813 1011
str(df)
## 'data.frame': 32 obs. of 9 variables:
## $ tree: int 4 5 12 28 33 34 57 64 65 76 ...
## $ dbh : num 45.8 23.1 54.9 36.7 58.5 43 59.6 45.7 36.1 42.5 ...
## $ tht : num 31.6 22.1 33.3 32.1 36 30.9 34.1 27.8 27.9 31.1 ...
## $ lcl : num 11.1 11.2 16.5 12.2 17.1 14.1 18.8 12.6 11.7 14.4 ...
## $ fol : int 37 5 52 23 71 28 85 28 30 40 ...
## $ bch : int 85 7 111 56 148 31 152 48 54 48 ...
## $ brk : int 164 32 241 115 294 140 282 129 99 137 ...
## $ stm : int 953 188 1401 665 1706 813 1636 749 574 797 ...
## $ tab : int 1239 232 1805 858 2220 1011 2154 954 757 1021 ...
summary(df)
## tree dbh tht lcl
## Min. : 4.00 Min. :23.10 Min. :20.10 Min. : 8.50
## 1st Qu.: 64.75 1st Qu.:34.42 1st Qu.:25.30 1st Qu.:12.07
## Median :120.50 Median :45.75 Median :30.75 Median :15.55
## Mean :108.88 Mean :45.85 Mean :29.09 Mean :15.80
## 3rd Qu.:153.50 3rd Qu.:55.80 3rd Qu.:32.12 3rd Qu.:18.73
## Max. :186.00 Max. :80.30 Max. :36.00 Max. :27.60
## fol bch brk stm
## Min. : 5.00 Min. : 7.00 Min. : 32.00 Min. : 188.0
## 1st Qu.: 16.75 1st Qu.: 31.75 1st Qu.: 69.75 1st Qu.: 402.5
## Median : 38.00 Median : 80.50 Median :138.50 Median : 805.0
## Mean : 44.38 Mean :100.59 Mean :162.44 Mean : 943.0
## 3rd Qu.: 70.25 3rd Qu.:151.25 3rd Qu.:214.00 3rd Qu.:1242.8
## Max. :107.00 Max. :347.00 Max. :434.00 Max. :2517.0
## tab
## Min. : 232.0
## 1st Qu.: 554.8
## Median :1034.5
## Mean :1250.3
## 3rd Qu.:1700.8
## Max. :3404.0
library(ggplot2)
Because we are intrested in developing a model that predicts. Total aboveground biomass ‘tab’ based on ‘dbh’ and ‘tht’. our dependent variable is ‘tab’ and independent variables are ‘dbh’ and ‘tht’. Therefore, on the scatterplot, we plot ‘tab’ on y-axis and ‘dbh’, and ‘tht’ on the x-axis.
p1 = ggplot(data = df, aes(x=dbh, y=tab)) + geom_point() +
xlab("DBH(cm)") + ylab ("Total aboveground biomass (Kg)") +
theme_bw()
p1
p2 = ggplot(data = df, aes(x=tht, y=tab)) + geom_point() +
xlab("Total tree height (m)") + ylab ("Total aboveground biomass (Kg)") +
theme_bw()
p2
Now I will use R function ‘grid.arrange()’ within the ‘gridExtra’
library to combine these two plots into one. By specifying, ‘nrow=1’, I
am telling R that i want to to figure in one row.
library(gridExtra)
grid.arrange(p1, p2)
grid.arrange(p1, p2, nrow=1)
The reason for creating scatterplot is to determine if a linear model
will be appropriate. Additionally it can show us any unusal/ outlying
observation in the dataset. Here, we see that the relationship of dbh
and tht witht tab is not perfectly linear but approximately linear. we
will move forward with a linear model. specifically, we will fit a
multiple linear regression model that predicts tab (kg) as a function of
dbh (cm) and tht (m).
str(df)
## 'data.frame': 32 obs. of 9 variables:
## $ tree: int 4 5 12 28 33 34 57 64 65 76 ...
## $ dbh : num 45.8 23.1 54.9 36.7 58.5 43 59.6 45.7 36.1 42.5 ...
## $ tht : num 31.6 22.1 33.3 32.1 36 30.9 34.1 27.8 27.9 31.1 ...
## $ lcl : num 11.1 11.2 16.5 12.2 17.1 14.1 18.8 12.6 11.7 14.4 ...
## $ fol : int 37 5 52 23 71 28 85 28 30 40 ...
## $ bch : int 85 7 111 56 148 31 152 48 54 48 ...
## $ brk : int 164 32 241 115 294 140 282 129 99 137 ...
## $ stm : int 953 188 1401 665 1706 813 1636 749 574 797 ...
## $ tab : int 1239 232 1805 858 2220 1011 2154 954 757 1021 ...
summary(df)
## tree dbh tht lcl
## Min. : 4.00 Min. :23.10 Min. :20.10 Min. : 8.50
## 1st Qu.: 64.75 1st Qu.:34.42 1st Qu.:25.30 1st Qu.:12.07
## Median :120.50 Median :45.75 Median :30.75 Median :15.55
## Mean :108.88 Mean :45.85 Mean :29.09 Mean :15.80
## 3rd Qu.:153.50 3rd Qu.:55.80 3rd Qu.:32.12 3rd Qu.:18.73
## Max. :186.00 Max. :80.30 Max. :36.00 Max. :27.60
## fol bch brk stm
## Min. : 5.00 Min. : 7.00 Min. : 32.00 Min. : 188.0
## 1st Qu.: 16.75 1st Qu.: 31.75 1st Qu.: 69.75 1st Qu.: 402.5
## Median : 38.00 Median : 80.50 Median :138.50 Median : 805.0
## Mean : 44.38 Mean :100.59 Mean :162.44 Mean : 943.0
## 3rd Qu.: 70.25 3rd Qu.:151.25 3rd Qu.:214.00 3rd Qu.:1242.8
## Max. :107.00 Max. :347.00 Max. :434.00 Max. :2517.0
## tab
## Min. : 232.0
## 1st Qu.: 554.8
## Median :1034.5
## Mean :1250.3
## 3rd Qu.:1700.8
## Max. :3404.0
#Fit model
we use R function ‘lm()’ to fit a linear regression. Following line of code uses the dataset named ‘df’ and fits a multiple linear regression for tab as a function of dbh and tht.
mod1 = lm(tab ~ dbh + tht, data = df)
To see the results we use R function ‘summary()’ and include the name of the model inside the parenthesis as follows.
summary(mod1)
##
## Call:
## lm(formula = tab ~ dbh + tht, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -329.20 -102.31 -12.37 102.10 297.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1622.47 233.02 -6.963 1.18e-07 ***
## dbh 53.01 4.39 12.076 7.77e-13 ***
## tht 15.19 13.15 1.155 0.257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 171.9 on 29 degrees of freedom
## Multiple R-squared: 0.9597, Adjusted R-squared: 0.9569
## F-statistic: 345.1 on 2 and 29 DF, p-value: < 2.2e-16
Here, we get the regression coefficients and many more information. Intercept of the regression model is denoted as (Intercept) and slopes are denoted by the name of independent variables (dbh and tht in our case). The \(R^2\) value is also printed as the output. The quantity Residual standard error is the square root of MSE (Mean Squared Error) which is sum of squared error divided by the degree of freedom for the error.
To get the fitted values for the modeling dataset we can use ‘fitted()’ function. Following code will create a new variable called ‘tab_pred’ using the fitted model. First five observations are then printed using ‘head()’ function and specifying ‘n=5’ within the ‘head()’ function.
df$tab_pred = fitted(mod1)
head(df, n=5)
## tree dbh tht lcl fol bch brk stm tab tab_pred
## 1 4 45.8 31.6 11.1 37 85 164 953 1239 1285.5926
## 2 5 23.1 22.1 11.2 5 7 32 188 232 -62.1133
## 3 12 54.9 33.3 16.5 52 111 241 1401 1805 1793.8277
## 4 28 36.7 32.1 12.2 23 56 115 665 858 810.7844
## 5 33 58.5 36.0 17.1 71 148 294 1706 2220 2025.6932
Residuals analysis is critical in developing regression models. Residuals are used to test model assumptions and to determine whether the model is appropriate for the given dataset. we can use ‘resid()’ or ‘residuals()’ function in R to get the residuals from the ‘lm()’ fit.
r1 = ggplot() +
aes(fitted(mod1), residuals(mod1))+
geom_point()+
theme_bw()+
xlab("Fitted values") + ylab("Residuals") +
ggtitle("Residuals vs fitted plot") +
theme(plot.title = element_text(hjust = 0.5))
r1
Ideally, we want the residuals to scatter around zero and not show any
discernible pattern. Here we see a curvilinear trend in the residual.
Residuals decrease for fitted values from zero to about 2000 kg and then
they increase. This a common problem in the biomass data. Usually, this
problem can be resolved using a logarithmic transformation on both
dependent and independent variables.
Let’s fit a logarithmic model i.e., our model will predict ‘ln(tab)’ as a function of ‘ln(dbh)’ and ‘ln(tht)’. First we will create new variables ln(dbh), ln(tht), and ln(tab) which are log transformations of dbh, tht, and tab, respectively. Then we will create scatterplot of these transformed variables and put in one figure.
df$lndbh = log(df$dbh)
df$lntht = log(df$tht)
df$lntab = log(df$tab)
p3 = ggplot(data = df, aes(x =lndbh, y =lntab)) + geom_point()+
theme_bw()
p4 = ggplot(data = df, aes(x =lntht, y =lntab)) + geom_point()+
theme_bw()
grid.arrange(p3, p4, nrow=1)
Now refit the model with log transformed data. We will still use the
‘lm()’ function, but use the log transformed variables in the model as
below:
mod2 = lm(lntab ~ lndbh + lntht, data = df)
summary(mod2)
##
## Call:
## lm(formula = lntab ~ lndbh + lntht, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.119720 -0.032165 0.005529 0.033067 0.097387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.97818 0.22338 -17.81 < 2e-16 ***
## lndbh 1.67213 0.06667 25.08 < 2e-16 ***
## lntht 1.35594 0.12331 11.00 7.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05417 on 29 degrees of freedom
## Multiple R-squared: 0.995, Adjusted R-squared: 0.9947
## F-statistic: 2908 on 2 and 29 DF, p-value: < 2.2e-16
Let’s extract the adjusted R-squared from both un-transformed and log-transformed model and compare them.
summary(mod1)$adj.r.squared #un-transformed model
## [1] 0.9569015
summary(mod2)$adj.r.squared #log-transformed model
## [1] 0.9946959
Adjusted R-squared is better for log-transformed model, but we are interested in looking at the residuals as well.
r2 = ggplot() +
aes(fitted(mod2), residuals(mod2)) +
geom_point() +
theme_bw() +
xlab("Fitted values") + ylab("Residuals") +
ggtitle("Residuals vs. fitted plot ~ transformed model") +
theme(plot.title = element_text(hjust = 0.5))
r2
It will be helpful to plot residuals from two models together for comparison. we can do that by using the same ‘grid.arrange()’ function.
grid.arrange(r1, r2)
we see that the residual plot of log-transformed model looks better than
the un-transformed model. However, we need to take note that this model
predicts ‘ln(tab)’ instead of ‘tab’. we can use ‘exp()’ function to back
transform the predicted value to original unit. However, this model will
have transformation bias. There are several bias correction
factors suggested to alleviate this. One common correction is to
multiply the back transformed value by \(exp(MSE/2)\), where MSE is the mean squared
error of the logarithmic model.
we will use linear regression model to predict aboveground biomass using dbh and tht with component biomass. The data contain, foliage (fol), Branch (bch), stem bark (brk), stem wood (stm), and total aboveground biomass(tab) along with diameter at breast height (dbh), live crown length (lcl), and total tree height (tht).
since we are not interested in component biomass models, we will plot each component biomass vs. dbh and tht
stm = ggplot(data = df, aes(x=dbh, y=stm)) +
geom_point() +
xlab("DBH (cm)") + ylab("Stemwood biomass (kg)")+
theme_bw()
brk = ggplot(data = df, aes(x=dbh, y=brk)) +
geom_point() +
xlab("DBH (cm)") + ylab("Stembark biomass (kg)")+
theme_bw()
bch = ggplot(data = df, aes(x=dbh, y=bch)) +
geom_point() +
xlab("DBH (cm)") + ylab("Branchwood biomass (kg)")+
theme_bw()
fol = ggplot(data = df, aes(x=dbh, y=fol)) +
geom_point() +
xlab("DBH (cm)") + ylab("Foliage biomass (kg)")+
theme_bw()
grid.arrange(stm, brk, bch, fol)
Now component biomass vs. total tree height
stm = ggplot(data = df, aes(x=tht, y=stm)) +
geom_point() +
xlab("Total tree height (m)") + ylab("Stemwood biomass (kg)")+
theme_bw()
brk = ggplot(data = df, aes(x=tht, y=brk)) +
geom_point() +
xlab("Total tree height (m)") + ylab("Stembark biomass (kg)")+
theme_bw()
bch = ggplot(data = df, aes(x=tht, y=bch)) +
geom_point() +
xlab("Total tree height (m)") + ylab("Branchwood biomass (kg)")+
theme_bw()
fol = ggplot(data = df, aes(x=tht, y=fol)) +
geom_point() +
xlab("Total tree height (m)") + ylab("Foliage biomass (kg)")+
theme_bw()
grid.arrange(stm, brk, bch, fol)
# Logarithmic transformation
df$lndbh = log(df$dbh)
df$lntht = log(df$tht)
df$lntab = log(df$tab)
df$lnlcl = log(df$lcl)
df$lnstm = log(df$stm)
df$lnbrk = log(df$brk)
df$lnbch = log(df$bch)
df$lnfol = log(df$fol)
There are two options when fitting component models. One is to fit them independently using ordinary least square (OLS) method and the second is to use seemingly unrelated regression (SUR). Since the data we use in component models come from same tree, it is common to use the SUR method which allows cross-equation correlation. In R, the SUR is available via ‘systemfit()’ package. Let’s first define our component models and create our system of equations.
I will also use ‘dbh’ and ‘lcl’ (length of live crown) as the predictors.
library("systemfit")
## Warning: package 'systemfit' was built under R version 4.4.2
## Loading required package: Matrix
## Loading required package: car
## Warning: package 'car' was built under R version 4.4.2
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Please cite the 'systemfit' package as:
## Arne Henningsen and Jeff D. Hamann (2007). systemfit: A Package for Estimating Systems of Simultaneous Equations in R. Journal of Statistical Software 23(4), 1-40. http://www.jstatsoft.org/v23/i04/.
##
## If you have questions, suggestions, or comments regarding the 'systemfit' package, please use a forum or 'tracker' at systemfit's R-Forge site:
## https://r-forge.r-project.org/projects/systemfit/
stm = lnstm ~ lndbh + lntht
brk = lnbrk ~ lndbh + lntht
bch = lnbch ~ lndbh + lntht
fol = lnfol ~ lndbh + lnlcl
system = list(Stemwood.Model = stm,
Stembark.Model = brk,
Branch.Model = bch,
Foliage.Model = fol)
Now, we can use ‘systemfit()’ function to fit. We can supply ‘OLS’ or ‘SUR’ options within ‘systemfit()’ to fit OLS or SUR model, respectively.
fitOLS = systemfit(system, "OLS", data =df)
summary(fitOLS)
##
## systemfit results
## method: OLS
##
## N DF SSR detRCov OLS-R2 McElroy-R2
## system 128 116 5.9824 0 0.929197 0.983908
##
## N DF SSR MSE RMSE R2 Adj R2
## Stemwood.Model 32 29 0.089725 0.003094 0.055623 0.994627 0.994257
## Stembark.Model 32 29 0.088241 0.003043 0.055162 0.994723 0.994359
## Branch.Model 32 29 3.647359 0.125771 0.354642 0.868544 0.859478
## Foliage.Model 32 29 2.157080 0.074382 0.272731 0.907522 0.901144
##
## The covariance matrix of the residuals
## Stemwood.Model Stembark.Model Branch.Model Foliage.Model
## Stemwood.Model 0.00309397 0.00305987 -0.00164037 0.00586529
## Stembark.Model 0.00305987 0.00304280 -0.00108351 0.00615333
## Branch.Model -0.00164037 -0.00108351 0.12577100 0.05529339
## Foliage.Model 0.00586529 0.00615333 0.05529339 0.07438206
##
## The correlations of the residuals
## Stemwood.Model Stembark.Model Branch.Model Foliage.Model
## Stemwood.Model 1.000000 0.9972593 -0.0831560 0.386632
## Stembark.Model 0.997259 1.0000000 -0.0553868 0.409016
## Branch.Model -0.083156 -0.0553868 1.0000000 0.571675
## Foliage.Model 0.386632 0.4090156 0.5716745 1.000000
##
##
## OLS estimates for 'Stemwood.Model' (equation 1)
## Model Formula: lnstm ~ lndbh + lntht
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.359970 0.229358 -19.0095 < 2.22e-16 ***
## lndbh 1.556075 0.068451 22.7327 < 2.22e-16 ***
## lntht 1.518033 0.126611 11.9898 9.2504e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.055623 on 29 degrees of freedom
## Number of observations: 32 Degrees of Freedom: 29
## SSR: 0.089725 MSE: 0.003094 Root MSE: 0.055623
## Multiple R-Squared: 0.994627 Adjusted R-Squared: 0.994257
##
##
## OLS estimates for 'Stembark.Model' (equation 2)
## Model Formula: lnbrk ~ lndbh + lntht
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.1258701 0.2274530 -26.9325 < 2.22e-16 ***
## lndbh 1.5576203 0.0678826 22.9458 < 2.22e-16 ***
## lntht 1.5183571 0.1255591 12.0928 7.514e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.055162 on 29 degrees of freedom
## Number of observations: 32 Degrees of Freedom: 29
## SSR: 0.088241 MSE: 0.003043 Root MSE: 0.055162
## Multiple R-Squared: 0.994723 Adjusted R-Squared: 0.994359
##
##
## OLS estimates for 'Branch.Model' (equation 3)
## Model Formula: lnbch ~ lndbh + lntht
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.575705 1.462331 -3.81289 0.0006633 ***
## lndbh 3.008349 0.436428 6.89312 1.4213e-07 ***
## lntht -0.462732 0.807239 -0.57323 0.5709092
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.354642 on 29 degrees of freedom
## Number of observations: 32 Degrees of Freedom: 29
## SSR: 3.647359 MSE: 0.125771 Root MSE: 0.354642
## Multiple R-Squared: 0.868544 Adjusted R-Squared: 0.859478
##
##
## OLS estimates for 'Foliage.Model' (equation 4)
## Model Formula: lnfol ~ lndbh + lnlcl
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.079375 0.587819 -10.34225 3.0590e-11 ***
## lndbh 1.732954 0.222173 7.80001 1.3324e-08 ***
## lnlcl 1.113205 0.229944 4.84121 3.9436e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.272731 on 29 degrees of freedom
## Number of observations: 32 Degrees of Freedom: 29
## SSR: 2.15708 MSE: 0.074382 Root MSE: 0.272731
## Multiple R-Squared: 0.907522 Adjusted R-Squared: 0.901144
The parameter estimates from this fit will be same as if we fitted four separate models using ‘lm()’ function. See example for foliage model:
fitfol = lm(lnfol ~ lndbh + lnlcl, data = df)
summary(fitfol)
##
## Call:
## lm(formula = lnfol ~ lndbh + lnlcl, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49041 -0.14550 0.01241 0.17623 0.52766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.0794 0.5878 -10.342 3.06e-11 ***
## lndbh 1.7330 0.2222 7.800 1.33e-08 ***
## lnlcl 1.1132 0.2299 4.841 3.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2727 on 29 degrees of freedom
## Multiple R-squared: 0.9075, Adjusted R-squared: 0.9011
## F-statistic: 142.3 on 2 and 29 DF, p-value: 1.018e-15