Introduction

Trees in general increase in their circumference as their age increases. Depending on tree type there are can be some differences on how tree grows in circumference depending on how old they are. In order to look for those changes i chose Orange dataset in R to investigate my hypothesis.

The daaset contains 3 variables, such as tree, age, and circumference. In the dataset observation we have 5 types of Orange trees, and values of each tree how old they were throughout the observation and circumference rates on a given age.

In general there are 35 observations for 5 trees, which is not a lot, however I will try to look for relationship between age and circumference via chosen dataset.

Visualization

The two variables that are related to this problem in the Orange dataset are circumference and age. Below I visualized the relationship with a scatterplot graphics.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
graph = Orange %>% 
  ggplot(
    aes(
      x = age,
      y = circumference
    )
  )+
  geom_point(
    size = 3,
    alpha = .27
  )+
  labs(
    x = "Age",
    y = "Circumference",
    title = " Association between age and circumference of Orange trees"
  )

graph

The relationship between circumference and age appear to be positive, and some patterns of non-linearity, as a reason circumference of a tree does not increase as much as age of a tree increases. Next, I will use ordinary least squares (OLS), to find a best fitting regression model for the dataset.

Model fit

First in order to construct a model and plot in further I loaded ggfortify from library, which will provide a data visualization tools for statistical analysis results. To fit this model I used the lm() function. After i put variables between which i want to find the relationship and indicated a dataset from where the variables are. After by using autoplot function I ran residual plots.

library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.5.2
fit.1 = lm(
  formula = age ~ circumference,
  data = Orange
)

fit.1
## 
## Call:
## lm(formula = age ~ circumference, data = Orange)
## 
## Coefficients:
##   (Intercept)  circumference  
##        16.604          7.816
autoplot(fit.1)

The first residuals vs fitted plot makes it very obvious that there are issues with non-linearity, which proves my previous point on appeared patterns. Thus, now I will refit the model using a quadratic polynomial curve and determined if this was a better fit.

fit.2 <- lm(
  formula = age ~ poly(circumference,2),
  data = Orange
)

autoplot(fit.2)

Though the residuals are not perfect, but nearly residuals plot suggest a better results than the original first fit. To visualize the model, i used the saved graph data on scatterplot and added geom_point function to construct a fitting line on association between age and circumference of Orange trees.

graph + 
  geom_smooth(
    color = "#39FF14",
    method = "lm",
    formula = y ~ poly(x,2)
  )

From plot it is visible that in the initial points it has a great fitness however it worsens by the end of an association. We can see that there are some unequal distribution of values by the tail around confidence interval. In general it is a better fitting model.

To investigate the influence of each coefficient, I inspected the summary of a fitted model.

summary(fit.2)
## 
## Call:
## lm(formula = age ~ poly(circumference, 2), data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -337.85  -94.63    6.62  117.14  382.94 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               922.14      30.02  30.719  < 2e-16 ***
## poly(circumference, 2)1  2620.01     177.59  14.753 7.92e-16 ***
## poly(circumference, 2)2  -593.27     177.59  -3.341  0.00214 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 177.6 on 32 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8696 
## F-statistic: 114.4 on 2 and 32 DF,  p-value: 2.637e-15

Here we see the positive linear effect of age on circumference. This is also visible in the presented scatterplot, when in the beginning dependent variable, circumference increased as the independent variable, age increase. Next quadratic negative value indicates how association between variables gets weaker, and after certain age of a tree circumference does not increase as much as it used to.

Additional exploration

After finding out the relationship between variables, I want to know if observation on different Orange trees have different values and effects on circumference. Thus I decided to add some more functions and means on discovering and indicating different trees used for a dataset observation.

graph.coloured = Orange %>% 
  ggplot(
    aes(
      x = age,
      y = circumference,
      colour = factor(Tree)
    )
  )+
  geom_point(
    size = 5,
    alpha = .5
  )+
  scale_color_manual(
    values = c("darkred","darkblue","darkgreen","purple","yellow")
  )+
  labs(
    x = "age",
    y = "circumference",
    title = "Association between circumference and age",
    color = "Tree number"
  )+
  theme(legend.position = "bottom")

graph.coloured

Now it is visible which tree in observation had changes in circumference at which age. But it is still hard to see different effect of age on circumference for each tree, thus i decided to filter only orange tree number 1, and see if it has the same association as predicted.

Orange %>% 
  filter(Tree == 1) %>% 
  ggplot(
    aes(
      x = circumference,
      y = age
    )
  )+
  geom_point()

The association seems linear in the beginning but after reaching 3rd point observation tree has a non linear association between variables. Next we will try to fit the model of overall observations of 5 trees, and get summary of the values to check the relationship.

Final model

To fit this model, I used the following syntax:

fit.final <- lm(
  formula = age ~ circumference * factor(Orange$Tree),
  data = Orange
)

#### Summarise Model ####
summary(fit.final)
## 
## Call:
## lm(formula = age ~ circumference * factor(Orange$Tree), data = Orange)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.850  -73.411   -6.803   71.967  132.009 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -147.5247    38.7817  -3.804 0.000818 ***
## circumference                          9.5421     0.3223  29.608  < 2e-16 ***
## factor(Orange$Tree).L                125.9317    87.1294   1.445 0.160782    
## factor(Orange$Tree).Q                -17.6314    85.2139  -0.207 0.837758    
## factor(Orange$Tree).C                -41.5741    89.3904  -0.465 0.645897    
## factor(Orange$Tree)^4                116.5972    85.0693   1.371 0.182677    
## circumference:factor(Orange$Tree).L   -4.3836     0.7338  -5.974 3.08e-06 ***
## circumference:factor(Orange$Tree).Q    0.3017     0.7104   0.425 0.674716    
## circumference:factor(Orange$Tree).C    1.0685     0.7415   1.441 0.161984    
## circumference:factor(Orange$Tree)^4   -0.8279     0.6960  -1.189 0.245422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.44 on 25 degrees of freedom
## Multiple R-squared:  0.9735, Adjusted R-squared:  0.9639 
## F-statistic: 101.9 on 9 and 25 DF,  p-value: < 2.2e-16

The model has a great fit, with just over 97%. The main effect of circumference is highly significant, showing that circumference increases consistently as age increases. On average, each circumference unit increases to approximately 9.5 units for an age.This means that there is a positive correlation of age on circumference. Also, the standard error of circumference is low at 0.3223, which means the variable values are stable and correlate with the slope. However, 3rd tree had a 89.39 standard error mean, which had a far from slope value. To get a sense of the range of possible intercept/slope values, I also inspected the CIs:

confint(fit.final)
##                                            2.5 %     97.5 %
## (Intercept)                         -227.3969927 -67.652308
## circumference                          8.8783220  10.205810
## factor(Orange$Tree).L                -53.5146184 305.378026
## factor(Orange$Tree).Q               -193.1327115 157.869928
## factor(Orange$Tree).C               -225.6769984 142.528845
## factor(Orange$Tree)^4                -58.6062343 291.800613
## circumference:factor(Orange$Tree).L   -5.8948355  -2.872331
## circumference:factor(Orange$Tree).Q   -1.1613508   1.764676
## circumference:factor(Orange$Tree).C   -0.4586225   2.595688
## circumference:factor(Orange$Tree)^4   -2.2612535   0.605536

Circumference has a strong and precise predictor of age, with a tight CI between 8.87 and 10.20. One interaction in observation of a first tree (circumference × Tree.L) is significant, with a CI entirely below zero, -5.89 indicating large differences in growth rates across trees following a non-linear pattern.

autoplot(fit.final)

The model residual has no perfect fitness, other types of solutions on non-linearity can be used in order to reach the final perfect good fitness of a model. Now by using already saved data on colored plot we can see model fitness of each tree observation below by using geom_smooth function.

graph.coloured +
  geom_smooth(
    method = "lm"
  )
## `geom_smooth()` using formula = 'y ~ x'

Conclusion