Simple Linear Regression for Linear Relationship

The overall purpose of this exercise is to ease myself into linear regression. I go for some intermediate to advanced methods within my statistical analysis. There definitely may be over use of statistical methods here, but I just want to put to use some of the things I’ve been learning. This exercise allowed me to do just that. Throughout the analysis, I’m able to detect autocorrelation of residuals and also heteroscedasticity while going through a 7 bullet list of regression assumptions. I actually chose a simple data set for this exercise because, well, I didn’t think it would be this long! However, I started having fun doing it, so I thought, why not keep going? I apologize ahead of time for the extensiveness, but I think that’s what data is all about—diving deep! Hope you enjoy reading!

 # Load packages
#install.packages("tidyverse")
#install.packages("DT")
#install.packages("lmtest")
#install.packages("plotly")

Find Data

?Orange
## starting httpd help server ... done
 # The Orange data frame has sample data records of the growth
 # of orange trees. If you're a fan of OJ, you might want to
 # stick around for this!

View Data

glimpse(Orange)
## Rows: 35
## Columns: 3
## $ Tree          <ord> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ age           <dbl> 118, 484, 664, 1004, 1231, 1372, 1582, 118, 484, 664, 10…
## $ circumference <dbl> 30, 58, 87, 115, 120, 142, 145, 33, 69, 111, 156, 172, 2…
View(Orange)

Orange |> datatable(options = list(pageLength = 9))

Ask Questions

 # Question 1
 # Does the age of an Orange Tree have a correlation to the
 # overall size of an Orange Tree?

 # Question 2
 # Which Orange Tree has the largest circumference
 # on average?

 # Question 3
 # Are the differences in Orange Tree circumference means
 # statistically significant?

Assign data to alias object.

df <- Orange

 # Get feel for data structure.
glimpse(df)
## Rows: 35
## Columns: 3
## $ Tree          <ord> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ age           <dbl> 118, 484, 664, 1004, 1231, 1372, 1582, 118, 484, 664, 10…
## $ circumference <dbl> 30, 58, 87, 115, 120, 142, 145, 33, 69, 111, 156, 172, 2…
str(df)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   35 obs. of  3 variables:
##  $ Tree         : Ord.factor w/ 5 levels "3"<"1"<"5"<"2"<..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ age          : num  118 484 664 1004 1231 ...
##  $ circumference: num  30 58 87 115 120 142 145 33 69 111 ...
##  - attr(*, "formula")=Class 'formula'  language circumference ~ age | Tree
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time since December 31, 1968"
##   ..$ y: chr "Trunk circumference"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(mm)"

Transform data types accordingly.

df$age <- as.integer(df$age)

class(df$age)
## [1] "integer"
 # Assign new data to new object.
df1 <- df

Re-code Tree variable.

 # Tree variable doesn't seem to have order to
 # data, so I'll remove ordinal classification.
df1$Tree <- factor(df1$Tree, order = FALSE,
                   levels = c("1", "2", "3", "4", "5"))

 # Check levels and class of variables.
levels(df1$Tree)
## [1] "1" "2" "3" "4" "5"
 # Re-code levels of Tree variable for clearer
 # categorization.
df1 <- df1 |> 
  mutate(Tree = recode(Tree, "1" = "Tree A",
                       "2" = "Tree B", "3" = "Tree C",
                       "4" = "Tree D", "5" = "Tree E"))

View(df1)

df1 |> datatable(options = list(pageLength = 9))

Normalize column title schema.

# In practice, lowercase may be better; I appreciate
# upper case since R is case-sensitive. In SQL, e.g.,
# I'd go with lower case titles.
df1 <- df1 |> 
  rename("Age" = "age")

df1 <- df1 |> 
  rename("Circumference" = "circumference")

View(df1)

df1 |> datatable(options = list(pageLength = 3))
 # Data looks good in respects to pre-processing.
 # Since the purpose of this exercise is for statistical
 # practice, it may not be a typical EDA process. Instead,
 # I'll go directly into summarization with stats,
 # visualization and analysis.

Double check data types for calculations.

sapply(df1, class)
##          Tree           Age Circumference 
##      "factor"     "integer"     "numeric"

Ask more questions.

 # Question 4
 # Do Orange Trees continue to get larger circumferences
 # as they get older?

 # Question 5
 # Does this phenomena reverse at any particular point
 # in time throughout an Orange Tree's lifespan?

Run Pearson’s Correlation test.

 # First, I'll run a Pearson Correlation test to briefly
 # evaluate the association between Age and Circumference.
 
 # Check for normality of variables
qqnorm(df1$Circumference)

qqline(df1$Circumference)

qqnorm(df1$Age)

qqline(df1$Age)

 # Both variables seem approximately normally
 # distributed on QQ-Plot.
 # I'll run cor.test().
?cor.test
 # There are several assumptions that generally must
 # be met in order to validate a Pearson's Correlation;
 # I'll omit these requirements as I'm only using it as a
 # reference to have some sentiment on a potential relationship
 # of variables I've already hypothesized.
PearsonCorr <- cor.test(df1$Age, df1$Circumference)

PearsonCorr
## 
##  Pearson's product-moment correlation
## 
## data:  df1$Age and df1$Circumference
## t = 12.9, df = 33, p-value = 1.931e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8342364 0.9557955
## sample estimates:
##       cor 
## 0.9135189
 # There seems to be a significant, strong, linear
 # correlation between tree age and tree circumference.
 # I'll start a regression model to visualize and
 # examine the linear relationship further.
 
 # I need a Linear regression summary to analyze
 # the impact that tree age has on tree size, but first
 # I'll check that assumptions for linear regression 
 # are fully met.

Linear Regression Assumptions

Data must have:

1.) 10 or more observations per independent variable

2.) No multi-collinearity or autocorrelation in independent/predictor variables

3.) Clear linearity of outcome and predictor variables

4.) Dependent variable and residuals follow a normal distribution

5.) No correlation of independent variables and residuals

6.) Homoscedasticity/homogeneity of variance in residuals

7.) Independence of residuals(no residual autocorrelation)

Assumption 1

-10 or more observations per independent variable

length(df1$Age)
## [1] 35
length(df1$Circumference)
## [1] 35
 # There are a total of 35 observations each.
 # Assumption 1 has been met.

Assumption 2

-No multicollinearity or autocorrelation of independent/predictor variables

 # Since this is not Multiple Regression I only
 # have 1 independent variable. Therefore, there can
 # be no multicollinearity.
 # Assumption 2 has been met.

Assumption 3

-Clear linearity of outcome and predictor variables

plot(Circumference ~ Age, data = df1 )

 # The relationship looks fairly linear;
 # I'm few steps from being able to produce my linear model!
 # Assumption 3 has been met.

Assumption 4

-Dependent variable and residuals both follow a normal distribution

 # Check dependent variable for normality
circ <- df1$Circumference

hist(circ)

 # Outcome/Dependent/Y variable seems roughly bell shaped.
qqnorm(circ)

qqline(circ)

 # Dependent variable fits to QQ-line. There seems 
 # to be a few outliers, I'll look further into this soon.
 # For now, I'll consider this normally distributed.
 # In order to check distribution of residuals, I need to
 # produce my linear model!

Fit linear model; interpret results.

Age_lm <- lm(Circumference ~ Age, data = df1 )
 
Age_lm
## 
## Call:
## lm(formula = Circumference ~ Age, data = df1)
## 
## Coefficients:
## (Intercept)          Age  
##     17.3997       0.1068
summary(Age_lm)
## 
## Call:
## lm(formula = Circumference ~ Age, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -14.946  -0.076  19.697  45.111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.399650   8.622660   2.018   0.0518 .  
## Age          0.106770   0.008277  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Check model fit and residuals.

 # Check to see if residuals are normally distributed
 # and model is actually a good fit for the data. In addition,
 # see to it that we don’t have too large variation
 # in the model residuals.
plot.new()

attributes(Age_lm)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
par(mfrow=c(2,2))

Age_lm$residuals |> hist()

plot(density(Age_lm$residuals))
 
 # Histogram and density plot of residuals looks almost
 # perfectly normally distributed.
 # I'll check the residuals in a QQ-Plot next.
qqnorm(Age_lm$residuals)

qqline(Age_lm$residuals)
 
 # The residuals seem to align closely with my QQ-line.
 # Taking this plot into account along with my 
 # normally distributed histogram, I'll hereby
 # conclude residuals to be normally distributed.
 # Assumption 4 has been met!

Interpreting first lines of model results.

With the first 4 assumptions met, I’ll proceed and begin interpreting first lines of model results.

summary(Age_lm)
## 
## Call:
## lm(formula = Circumference ~ Age, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -14.946  -0.076  19.697  45.111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.399650   8.622660   2.018   0.0518 .  
## Age          0.106770   0.008277  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14
 # Again, residuals are normally distributed.
 # As for coefficients results...
 # Firstly, and most importantly, I see that the
 # p value is less than 5%(p value < 0.001)!
 # This tells me the model fits the data really well!
 # This says that the probability of finding the given t value
 # or t statistic if the null hypothesis H0—that there
 # is no relationship— were true, is extremely low
 # and unlikely. Therefore, I can accept the alternative.
 # Age variable's correlation to Circumference is statistically
 # significant.
 
 # The Estimate shows estimated effect of (x)Age on
 # (y)Circumference is (0.106770). There's a
 # significant, positive relationship between Age and
 # Circumference with a 0.106770/unit increase in
 # happiness for every unit increase in income.
 
 # I'll move on to 5th assumption before interpreting,
 # the rest of residuals as they may be invalid
 # without this homoscedasticity assumption requirement.

Assumption 5

-No correlation of independent variables and residuals

par(mfrow=c(1,1))
 
plot(Age_lm$residuals, df1$Age)

 # There is no clear linearity or correlation
 # between the independent variable Age and the
 # model residuals.
 # Assumption 5 has been met.

Assumption 6

-Homoscedasticity/homogeneity of variance in residuals

 # Check to see if residuals are good fit for data,
 # and that there isn't too large of a variation in model
 # error.
 
 # Plot Residuals against fitted values
 cbind(Fitted = fitted(Age_lm),
       Residuals=residuals(Age_lm)) |> 
   as.data.frame() |> 
   ggplot(aes(x=Fitted, y=Residuals)) + 
   geom_point() + theme_classic()

 # Add 0 line
 plot(df1$Age, Age_lm$residuals)
 abline(h=0)

 # I do notice a fanning or dispersing of points.
 # If a pattern is observed here, there may be
 # heteroscedasticity in the errors, meaning the
 # variance of the residuals may not be constant.
 
 # I guess Data Science isn't all as glamorous as they
 # make it seem! Still, I'm the man for the job.
 # I'll investigate this further with more residual
 # diagnostics plots.
 
par(mfrow=c(2,2))

plot(Age_lm)

par(mfrow=c(1,1))

# Q-Q residual plot looks normalized.

# Residual vs Fitted has a bit of heteroscedasticity,
# yet, the moving average(red line) is centered around zero which
# is a good sign.

# The Scale-Location plot looks like there's definitely
# an issue regarding constant residuals. The red line
# is trending positively along with values.

# I can run a Breusch Pagan Test
# to check for homoscedasticity more conventionally!
bptest(Age_lm)
## 
##  studentized Breusch-Pagan test
## 
## data:  Age_lm
## BP = 11.228, df = 1, p-value = 0.0008056
# This confirms heteroscedasticity.
# With a p-value of 0.0008056, there is strong
# heteroskedasticity, as suspected. 

# Although I discovered that the residual is not exactly
# constant, I'll continue with this analysis for
# the purpose of seeing if I can overlook
# this assumption, considering all others have
# been met.

Assumption 7

-Independence of residuals(no residual autocorrelation)

# I can perform a Durbin-Watson residual autocorrelation
# test, to check for autocorrelation of the residuals in a 
# linear regression model.
library("car")
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(Age_lm)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.6629058     0.6695267       0
##  Alternative hypothesis: rho != 0
# DW Test is implying there's an auto-correlation
# in residuals.

# Output shows test statistic
# of 0.6695267 with p value of 0.
# Since the p value is less than 0.05, I must reject
# the null hypothesis that there is no correlation
# and accept the alternative, that the residuals
# in the model are auto-correlated.

# To double-check,
# I'll run a different version of the test.
dwtest(formula = Age_lm,  alternative = "two.sided")
## 
##  Durbin-Watson test
## 
## data:  Age_lm
## DW = 0.66953, p-value = 4.785e-06
## alternative hypothesis: true autocorrelation is not 0
# Same result is returned here with a further
# defined p-value and less defined test statistic.
# I'll hereby conclude, there is indeed autocorrelation in
# the residuals of this model.

# The forecasts from a model with auto-correlated errors
# are still unbiased, and aren't necessarily “wrong”,
# but can overpredict intervals larger than needed.
# This linear model will not be used for predictions,
# and is mainly being used to establish statistical
# confirmation of a strong correlation between the age 
# and circumference of Orange Trees.

# While in a predictable model this would require remedy,
# I think I can work around this for the purpose of
# this exercise.
plot(Age_lm$residuals, type = "l")

?acf
acf(Age_lm$residuals)

# There are geometric shapes in the acf and line plot.
# I could look into some Auto-Correlation Functions
# to transform and correct the patterns,
# but for the sake of shortening the length of 
# this exercise I'll move on.

Remedy assumption violations.

I’ll move forward with analysis, and forgo taking any variable transformation measures to compensate for the non-constant sample residual errors. I think it’ll be fine for the purpose of this analysis.

# I'm overlooking the autocorrelation and heteroscedasticity
# in the residuals based on 3 main reasons.

# 1st reason
# There's a strong normality of these residuals demonstrated
# in the QQ-Plot.

# 2nd reason
# The red lines representing the mean of the residuals
# on the fitted values plot are fairly horizontal
# and centered around zero. That says that outliers and
# biases significant to this analysis in the data
# are less likely. 

# 3rd reason
# Most of these assumptions are meant for multiple linear
# regression forecasting models.
# This model will not be used for predictions; this is a
# simple regression mainly for correlation analysis.

# I conclude model meets assumption 6 of homoscedasticity
# and assumption 7 of no residual autocorrelation for
# this analysis. Finally, I'll plot the regression model!

Fit the linear regression model.

#plot.new()

Age_lm_plot <- 
  ggplot(df1, aes( x = Age, y = Circumference )) + 
  geom_point() +
  geom_smooth( method ="lm", color = "cyan3" ) +
  labs( x = "Age of Tree", y = "Tree Circumference", title = "Linear Regression \n of Age & Circumference" ) +
  theme(plot.title = element_text( hjust = 0.5, size = 15, face='bold')) +
  theme( axis.title.x = element_text(color = "black", size = 11, face = "bold"),
         axis.title.y = element_text(size = 11, face = "bold"),
         panel.background = element_rect(fill = "white", colour = "lightgray", size = .5, linetype = "solid"))
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Age_lm_plot
## `geom_smooth()` using formula = 'y ~ x'

 # The model shows a strong, positive, linear correlation 
 # between Age and the Circumference of Orange Trees!
 # Seems as though the older an Orange Tree is, the larger
 # it's circumference is.

 # This makes sense if the older a tree is,
 # potentially impacts or determines how much longer
 # time it's had to grow.

Re-evaluate shape and structure of data.

dim(df1)
## [1] 35  3
View(df1)

df1 |> datatable(options = list(pageLength = 9))
str(df1)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   35 obs. of  3 variables:
##  $ Tree         : Factor w/ 5 levels "Tree A","Tree B",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Age          : int  118 484 664 1004 1231 1372 1582 118 484 664 ...
##  $ Circumference: num  30 58 87 115 120 142 145 33 69 111 ...

Group and summarize data

# Group data for summary and calculate mean
 # circumference for all 7 different tree types.
 Grouped_Average <- df1 |>  group_by(Tree) |> 
   summarise(Average = mean(Circumference)) |> 
   arrange(desc(Average))
 
 View(Grouped_Average)
 
 Grouped_Average |> datatable(options = list(pageLength = 5))
 # Tree D had the largest circumference on average.
 # Tree B was the 2nd largest circumference on average.
 # Tree E was the 3rd largest circumference on average.
 
 # I do find it odd that we had Tree C
 # to come to a perfectly round integer when
 # mean was calculated.
 
 # I'll investigate this further.

Double-check mean of Tree C.

# Perform mean calculation by manually.
 df1 |>
   group_by(Tree) |>
   summarise(sum = sum(Circumference)) |>
   View()
 
 df1 |>
   group_by(Tree) |>
   summarise(mean2 = sum(Circumference)/7) |>
   View()
 
 658/7
## [1] 94
 # Everything turns out to be fine! Ironically,
 # the actual and true mean of Tree C
 # at 94.0 seems to be an accurate 
 # calculation after all.
 
 # Now, I'll visualize this finding.
 
 # I'll start some line charts to get a visual of the tree
 # size comparisons
 
 ggplot(df1, aes(x=Age, y=Circumference, color=Tree ) ) +
   geom_line()

 # I can see Tree D is the largest on average!
 # Tree C is the smallest; I'll highlight this to make
 # insights clearer. 
 
 ggplot(df1, aes(x=Age, y=Circumference, color=Tree ) ) + 
   geom_line(size=1.5) +
   scale_color_manual(values=c("lightgrey", "lightgrey", "black", "cyan3", "lightgrey")) +
   xlim(100, 1600) +
   labs( title = "Line Graph of Average Tree Size" ) +
   theme(plot.title = element_text( hjust = 0.5, size = 15, face='bold')) +
   theme( axis.title.x = element_text(color = "black", size = 11, face = "bold"),
          axis.title.y = element_text(size = 11, face = "bold"),
          panel.background = element_rect(fill = "white", colour = "lightgray", size = .5, linetype = "solid"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

 # The line plot now shows a clear depiction of the
 # largest and smallest Orange Tree Types on average!
 
 # I'll make a wider data frame of data
 # that may be better for viewing and easier
 # for more summarization metrics!
 #?pivot_wider
 
 df1_wide <- pivot_wider(df1, names_from = "Tree",
                         values_from = "Circumference")
 
 View(df1_wide)
 df1_wide |> datatable(options = list(pageLength = 7))

Summarize 3 largest trees.

df1_wide |> 
   select(`Tree D`, `Tree B`, `Tree E`) |>
   summary()
##      Tree D          Tree B          Tree E     
##  Min.   : 32.0   Min.   : 33.0   Min.   : 30.0  
##  1st Qu.: 87.0   1st Qu.: 90.0   1st Qu.: 65.0  
##  Median :167.0   Median :156.0   Median :125.0  
##  Mean   :139.3   Mean   :135.3   Mean   :111.1  
##  3rd Qu.:194.0   3rd Qu.:187.5   3rd Qu.:158.0  
##  Max.   :214.0   Max.   :203.0   Max.   :177.0
 # Tree E had a min Circumference size of
 # 30, while Tree B had the largest min
 # size of 33.
 # Still, Tree D has the largest max, median
 # and overall average circumference.

Visualize size insights.

vio_mean <- ggplot(df1, aes(x = reorder(Tree, -Circumference), y = Circumference )) +
   geom_violin(trim=F, size=1, fill='white', color="black") +
   stat_summary(fun = "mean",geom = "crossbar", width = .35,
                color = "cyan3") +
   labs( x = "Tree Type", y = "Average Circumference", title = "Average Circumference by Tree Type")+
   theme_light()
 
vio_mean

  # Violin chart shows a really close comparison of averages
  # between Tree D and Tree B!
  # I'll plot this on a histogram to compare further.
Avg_Circ_Bar <- 
   ggplot(df1, aes(x = reorder(Tree, -Circumference),
                   y = Circumference)) +
   geom_bar(fill = "lightgrey", stat = "summary", fun = "mean") +
   labs( x = "Tree Type", y = "Average Circumference") +
   theme(plot.title = element_text( hjust = 0.5, size = 15)) +
   theme( axis.title.x = element_text(color = "black", size = 11),
          axis.title.y = element_text(size = 11),
          panel.background = element_rect
          (fill = "white", colour = "lightgray", size = .5, linetype = "solid"))

 Avg_Circ_Bar 

 # The bar chart gives a much better contrast between
 # the average of Tree D and Tree B. However, it's still
 # a close call when looking at the visualizations.
 # I'll use an interactive chart to bring out these
 # distinct average values considering how close they still seem!
plotly::ggplotly(Avg_Circ_Bar2)
 # Now I have an interactive chart showing actual
 # values to avoid confusion!
 
 
 # Lastly, I want to run some statistical tests to 
 # see if the differences in means I'm seeing here
 # are in fact significant. These seem 
 # extremely close, so my guess is that they're not.
 # I'll let the statistical power confirm this for me!
 # An ANOVA test will compare means across all variables
 # at once!

Run ANOVA and T-Tests

# T-Test and ANOVA Assumptions
 # Data must:

 # 1.) Be Normally Distributed - Has been met.
 par(mfrow=c(2,2))
 plot(density(df1_wide$'Tree A'))
 plot(density(df1_wide$'Tree B'))
 plot(density(df1_wide$'Tree C'))
 plot(density(df1_wide$'Tree D'))

 par(mfrow=c(1,1))
 plot(density(df1_wide$'Tree E'))

# 2.) Have similar or Equal Variance - Has been met.
var.test(df1_wide$`Tree C`, df1_wide$`Tree D`, alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  df1_wide$`Tree C` and df1_wide$`Tree D`
## F = 0.35737, num df = 6, denom df = 6, p-value = 0.236
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.06140644 2.07980947
## sample estimates:
## ratio of variances 
##          0.3573705
 # Instead of testing for equal variance
 # 1 by 1, I can test across an entire group of trees all
 # at once using Bartlett or Lavene tests!
#?bartlett.test()
#?leveneTest()
# Check data structure for variance testing compatibility.
View(df1)
df1 |> datatable(options = list(pageLength = 9))
str(df1)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   35 obs. of  3 variables:
##  $ Tree         : Factor w/ 5 levels "Tree A","Tree B",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Age          : int  118 484 664 1004 1231 1372 1582 118 484 664 ...
##  $ Circumference: num  30 58 87 115 120 142 145 33 69 111 ...
df1 <- as.data.frame(df1)
 # Run tests
bartlett.test(Circumference ~ Tree, data = df1)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Circumference by Tree
## Bartlett's K-squared = 2.4607, df = 4, p-value = 0.6517
 # Although data is normalized, I'll run a Levene test
 # to double check p-value.
car::leveneTest(Circumference ~ Tree, data = df1)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  4  0.5907  0.672
##       30
 # Returned p-values of 0.6517 and 0.672 aren't less than
 # 5% alpha value. This means that there's no evidence to
 # suggest variance in Tree Circumference is statistically
 # significantly different for the five Tree groups.
              
 # 3.) Be Independently Sampled - Has been met.
 # 4.) Be Randomly Sampled - Has been met.
 # 5.) Be Continuous - Has been met.
 
# All T-Test and ANOVA Assumptions have been met!

Run Tests

 # Run Tests to see the true statistical
 # significance in the differences in means
 # among the 3 largest trees.
 
 # recall aov() arguments
 #?aov
 # subset data for analysis of variance tests.
aovsubset <- df1 |> filter(Tree %in% c('Tree D', 'Tree B', 'Tree E')) |>
   select(Tree, Circumference)
 # view subset.
View(aovsubset)
aovsubset |> datatable(options = list(pageLength = 7))
 # Run aov test for 3 largest trees.
top.3.aov <- aov(formula = aovsubset$Circumference ~ aovsubset$Tree, data = aovsubset ) |>
   TukeyHSD(conf.level = 0.95)
 
top.3.aov
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = aovsubset$Circumference ~ aovsubset$Tree, data = aovsubset)
## 
## $`aovsubset$Tree`
##                    diff        lwr      upr     p adj
## Tree D-Tree B   4.00000  -85.91448 93.91448 0.9929212
## Tree E-Tree B -24.14286 -114.05734 65.77162 0.7749273
## Tree E-Tree D -28.14286 -118.05734 61.77162 0.7084884
 # *Run an ANOVA Test to see the true statistical
 # significance in the differences in means
 # among the all the trees.

 # Make 2nd subset data for analysis of variance tests.
aovsubset2 <- df1 |> select(Tree, Circumference)
 # View subset.   
View(aovsubset2)
aovsubset2 |> datatable(options = list(pageLength = 7))
 # Run aov test for all trees.   
aov.model <- aov(formula = aovsubset2$Circumference ~ aovsubset2$Tree, data = aovsubset ) |>
   TukeyHSD(conf.level = 0.95) 
 
aov.model
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = aovsubset2$Circumference ~ aovsubset2$Tree, data = aovsubset)
## 
## $`aovsubset2$Tree`
##                     diff        lwr       upr     p adj
## Tree B-Tree A  35.714286  -54.03528 125.46385 0.7765641
## Tree C-Tree A  -5.571429  -95.32099  84.17813 0.9997503
## Tree D-Tree A  39.714286  -50.03528 129.46385 0.7030223
## Tree E-Tree A  11.571429  -78.17813 101.32099 0.9956073
## Tree C-Tree B -41.285714 -131.03528  48.46385 0.6725728
## Tree D-Tree B   4.000000  -85.74956  93.74956 0.9999331
## Tree E-Tree B -24.142857 -113.89242  65.60671 0.9343443
## Tree D-Tree C  45.285714  -44.46385 135.03528 0.5930257
## Tree E-Tree C  17.142857  -72.60671 106.89242 0.9805850
## Tree E-Tree D -28.142857 -117.89242  61.60671 0.8909860
 # The 95% confidence interval includes 0 for all comparisons
 # in lower and upper end points showing the
 # possibility of there being no difference at all!
 # This is why high adjusted p-values are being returned from
 # Tukey's Honest Significant Difference!

 # I'll run a quick t-test to see if the largest and
 # and smallest trees are significantly different.
t.test(df1_wide$`Tree C`, df1_wide$`Tree D`, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  df1_wide$`Tree C` and df1_wide$`Tree D`
## t = -1.4304, df = 12, p-value = 0.1781
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -114.2673   23.6959
## sample estimates:
## mean of x mean of y 
##   94.0000  139.2857
 # with a p value of 0.1781, if H0 was true,
 # a similar sample mean like this could occur by chance
 # 17% of the time. This is beyond
 # my alpha value threshold of 5% and therefore is
 # not compelling evidence that the difference in means
 # are statistically significant.
 
 # ANOVA tests showed that the difference 
 # in means of the size trees were not significant.
 # This means that the difference could've potentially been by
 # chance, and that with a variation in standard error or in a different
 # sample of the same population, we might have seen
 # results that support the null hypothesis which states
 # there's no difference in means.
 
 # Due to the lack of statistical significance, I cannot
 # make inferences about the Population which
 # generated these samples.

Conclusion

In conclusion, the differences in mean Circumference of each tree group are not statistically significant. As a result, I have great insights within the samples, yet, I can’t use this analysis to make inferences about the actual populations that these samples derived from. However, I can still acknowledge the validity of these findings in the context of these particular samples.

Findings and Analysis

I found that tree ages have a strong relationship with the circumferences of trees. The older trees are, often the larger they are.

Tree D had the largest circumference on average. The 3 largest Trees on average are Tree D, E and B. The smallest Tree on average is Tree A. The differences in means were not statistically significant. If new samples were taken we may see different results and findings could be bychance.

From my analysis, I’d conclude that trees continue to grow as they age considering the data showed a steady climb in size among all trees. As the trees grow in age, so too does the the size of the tree by circumference. This sample shows no evidence of this process reversing or declining in regards to the steady progression of size growth with age.

Answer Questions

 # Question 1
 # Does the age of a tree have a correlation to the
 # overall size of a tree? 
 # ANSWER: Yes
 
 # Question 2
 # Which tree was the largest circumference
 # on average?
 # ANSWER: Tree D
 
 # Question 3
 # Is the difference in tree Circumference means
 # we're seeing statistically significant?
 # ANSWER: No.
 
 # Question 4
 # Do trees continue to get larger circumferences
 # as they age, or does their growth rate slow?
 # ANSWER: Yes. There is a strong correlation between
 # tree age and continual growth of tree circumference.
 
 # Question 5
 # Does this phenomena reverse at any particular time
 # in the tree's lifespan?
 # ANSWER: No. Line Charts show no evidence of decline in
 # growth of circumference as age progresses.

Closing Thoughts

The ANOVA test dampened things toward the end, but I still got all the questions answered! I’ll need further analysis and more data to definitively make inferences about the population.

This was a minuscule data set, yet, it was unique in the sense that I haven’t noticed anyone working with it online. This probably wasn’t the best data set to use, but I kept with it because I can use the source code for reproducibility in future projects.

Nonetheless, I had a blast challenging myself to find solutions when obstacles popped up. I managed to keep my composure in order to rationalize what otherwise could’ve been hindering disruptions to the analysis process. I really enjoyed this exercise!

This has been my statistical analysis on Orange Tree Growth!

Hope you enjoyed!

Thanks for viewing!