Trees in general increase in their circumference as their age
increases. Accordingly we can assume that there is a positive linear
relationship between these variables. Depending on tree type there are
can be some differences on how tree grows in circumference depending on
how old they are. In order to look for those changes i chose
Orange dataset in R to investigate my hypothesis.
The dataset contains 3 variables, such as tree,
age, and circumference. In the dataset
observation we have 5 types of Orange trees, and values of each tree how
old they were throughout the observation and circumference rates on
given ages.
In general there are 35 observations for 5 trees, which is not a lot, however I will try to look for relationship between age and circumference via chosen dataset.
The two variables that are related to this problem in the
Orange dataset are circumference and
age. Below I visualized the relationship with a scatterplot
graphics.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
graph = Orange %>%
ggplot(
aes(
x = age,
y = circumference
)
)+
geom_point(
size = 3,
alpha = .27
)+
labs(
x = "Age",
y = "Circumference",
title = " Association between age and circumference of Orange trees"
)
graph
The relationship between circumference and age appear to be positive, and some patterns of non-linearity, as a reason circumference of a tree does not increase as much as age of a tree increases. Next, I will use ordinary least squares (OLS), to find a best fitting regression model for the dataset.
First in order to construct a model and plot in further I loaded
ggfortify from library, which will provide a data
visualization tools for statistical analysis results. To fit this model
I used the lm() function. After i put variables between
which i want to find the relationship and indicated a dataset from where
the variables are. After by using autoplot function I ran
residual plots.
library(ggfortify)
fit.1 = lm(
formula = age ~ circumference,
data = Orange
)
fit.1
##
## Call:
## lm(formula = age ~ circumference, data = Orange)
##
## Coefficients:
## (Intercept) circumference
## 16.604 7.816
autoplot(fit.1)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The first residuals vs fitted plot makes it very obvious that there are issues with non-linearity, which proves my previous point on appeared patterns. Thus, now I will refit the model using a quadratic polynomial curve and determine if this was a better fit.
fit.2 <- lm(
formula = age ~ poly(circumference,2),
data = Orange
)
autoplot(fit.2)
Though the residuals are not perfect, but nearly residuals vs fitted
plot suggest a better results of fitness than the original first fit. To
visualize the model, i used the saved graph data on
scatterplot and added geom_point function to construct a
fitting line on association between age and
circumference of Orange trees.
graph +
geom_smooth(
color = "#39FF14",
method = "lm",
formula = y ~ poly(x,2)
)
From plot it is visible that in the initial points it has a great fitness however it worsens by the end of an association. We can see that there are some unequal distribution of values by the tail around confidence interval. In general it is a better fitting model.
To investigate the influence of each coefficient, I inspected the summary of a fitted model.
summary(fit.2)
##
## Call:
## lm(formula = age ~ poly(circumference, 2), data = Orange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -337.85 -94.63 6.62 117.14 382.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 922.14 30.02 30.719 < 2e-16 ***
## poly(circumference, 2)1 2620.01 177.59 14.753 7.92e-16 ***
## poly(circumference, 2)2 -593.27 177.59 -3.341 0.00214 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 177.6 on 32 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8696
## F-statistic: 114.4 on 2 and 32 DF, p-value: 2.637e-15
Here we see the positive linear effect of age on circumference. This is also visible in the presented scatterplot, when in the beginning dependent variable, circumference increased as the independent variable, age increase. Next quadratic negative value indicates how association between variables gets weaker, and after certain age of a tree circumference does not increase as much as it used to.
After finding out the relationship between variables, I want to know if observation on different Orange trees have different values and effects on circumference. Thus I decided to add some more functions and means on discovering and indicating different trees used for a dataset observation.
graph.coloured = Orange %>%
ggplot(
aes(
x = age,
y = circumference,
colour = factor(Tree)
)
)+
geom_point(
size = 5,
alpha = .35
)+
scale_color_manual(
values = c("darkred","darkblue","darkgreen","purple","yellow")
)+
labs(
x = "age",
y = "circumference",
title = "Association between circumference and age",
color = "Tree number"
)+
theme(legend.position = "bottom")
graph.coloured
Now it is visible which tree in observation had changes in circumference at which age. But it is still hard to see different effect of age on circumference for each tree, thus i decided to filter only orange tree number 1, and see if it has the same association as predicted.
Orange %>%
filter(Tree == 1) %>%
ggplot(
aes(
x = age,
y = circumference
)
)+
geom_point()
The association seems linear in the beginning but after reaching 3rd point observation tree has a non linear association between variables. Next we will try to fit the model of overall observations of 5 trees, and get summary of the values to check the relationship.
To fit this model, I used the following syntax:
fit.final <- lm(
formula = age ~ circumference * factor(Orange$Tree),
data = Orange
)
summary(fit.final)
##
## Call:
## lm(formula = age ~ circumference * factor(Orange$Tree), data = Orange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116.850 -73.411 -6.803 71.967 132.009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -147.5247 38.7817 -3.804 0.000818 ***
## circumference 9.5421 0.3223 29.608 < 2e-16 ***
## factor(Orange$Tree).L 125.9317 87.1294 1.445 0.160782
## factor(Orange$Tree).Q -17.6314 85.2139 -0.207 0.837758
## factor(Orange$Tree).C -41.5741 89.3904 -0.465 0.645897
## factor(Orange$Tree)^4 116.5972 85.0693 1.371 0.182677
## circumference:factor(Orange$Tree).L -4.3836 0.7338 -5.974 3.08e-06 ***
## circumference:factor(Orange$Tree).Q 0.3017 0.7104 0.425 0.674716
## circumference:factor(Orange$Tree).C 1.0685 0.7415 1.441 0.161984
## circumference:factor(Orange$Tree)^4 -0.8279 0.6960 -1.189 0.245422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.44 on 25 degrees of freedom
## Multiple R-squared: 0.9735, Adjusted R-squared: 0.9639
## F-statistic: 101.9 on 9 and 25 DF, p-value: < 2.2e-16
The model has a great fit, with just over 97%. The main effect of circumference is highly significant, showing that circumference increases consistently as age increases. On average, each circumference unit increases to approximately 9.5 units for an age.This means that there is a positive correlation of age on circumference. Also, the standard error of circumference is low at 0.3223, which means the variable values are stable and correlate with the slope. However, 3rd tree had a 89.39 standard error mean, which had a far from slope value. To get a sense of the range of possible intercept/slope values, I also inspected the CIs:
confint(fit.final)
## 2.5 % 97.5 %
## (Intercept) -227.3969927 -67.652308
## circumference 8.8783220 10.205810
## factor(Orange$Tree).L -53.5146184 305.378026
## factor(Orange$Tree).Q -193.1327115 157.869928
## factor(Orange$Tree).C -225.6769984 142.528845
## factor(Orange$Tree)^4 -58.6062343 291.800613
## circumference:factor(Orange$Tree).L -5.8948355 -2.872331
## circumference:factor(Orange$Tree).Q -1.1613508 1.764676
## circumference:factor(Orange$Tree).C -0.4586225 2.595688
## circumference:factor(Orange$Tree)^4 -2.2612535 0.605536
Circumference has a strong and precise predictor of age, with a tight CI between 8.87 and 10.20. One interaction in observation of a first tree (circumference × Tree.L) is significant, with a CI entirely below zero, -5.89 indicating large differences in growth rates across trees following a non-linear pattern.
autoplot(fit.final)
The model residual has no perfect fitness, other types of solutions
on non-linearity can be used in order to reach the final perfect good
fitness of a model. Now by using already saved data on colored plot we
can see model fitness of each tree observation below by using
geom_smooth function.
graph.coloured +
geom_smooth(
method = "lm"
)
## `geom_smooth()` using formula = 'y ~ x'
To sum up, in general age and circumference of Orange trees have a positive association in-between. But, there is a weak linear regression association, meaning by circumference does not necessarily increase in volume as age increases. after reaching a certain age, Orange trees continue growing old, but their circumference do not enlarge each year as they used to in the beginning of their lives. Even so, there is a non-linear association between variables of a tree, It is important to note that circumference still remains as one of the visible properties of a tree, by which we can indicate tree’s age. My initial hypothesis was partially correct, as I assumed that age of an Orange tree and circumference have a positive correlation between each other. However, this positive correlation is not linear, as they fluctuate and become non-linear after reaching certain age. Work on my final assessment was indeed interesting and helped me to improve my abilities and knowledge in R studio and R markdown, as I get to practice what I have learned throughout the course and apply in a way I want.