Project 1 Report

ENVX2001 Applied Statistical Methods

Published

March 23, 2025

My SID is: 540701410

Introduction

Jones et al (2021) conducted a systematic study examining the effects of diversified farming effects on biodiversity and yield, comparing diversified farming systems, to simplified farming systems using the global database. Their research assessed trade-offs of farm diversification across 48 countries, incorporating a range of production systems and geographical contexts. The study aimed to determine the extent to which diversified farming contributes to both biodiversity and agricultural productivity.

This report is based on a simulated dataset designed to reflect key trends and analytical approaches in Jones et al. 2021 study. The primary objective is to evaluate the relationship between farming system type, fertiliser application and species abundance on crop yield using statistical methods. The analysis provides insights into how different farming practices (systems and fertiliser usage) could have the potential role to impact agricultural productivity.

The dataset consists of 42 observation and 4 key variables, examining the relationship farming system, fertiliser use, abundance and crop yield. Each row represents an individual farm with recorded measurements for farming system either categorised as monoculture or diversified, a binary “yes” or “no” for whether fertiliser was applied, a recorded crop Yield (kg/ha) and abundance categorised as a numerical value of species abundance (count of species observed). This dataset contained both categorical variables (system and fertiliser) and numeric values (Yield and abundance). Since no specific units of measurement were provided, this report assigns yield to have a unit of measurement of kg/ha and Abundance to be in relation to species abundance referring to a count of observed species.

Initial exploration using str(crop) and summary(crop) confirmed all variables were present with no missing values (Figure 1 & 2). However, as there were two categorical variables the mutate function was used to get a numerical 1 or 2 for system (1 = diversified, 2= monoculture) and for fertiliser use (1 = no, 2= yes). This transformation ensured compatibility with multivariate statistical methods used in the subsequent analyses. Overall data summaries and discussions can be used to understand the potential impacts of farming system and fertiliser usage on agricultural productivity.

tibble [42 × 4] (S3: tbl_df/tbl/data.frame)
 $ System    : Factor w/ 2 levels "diversified",..: 1 2 1 1 1 2 2 2 1 1 ...
 $ Fertiliser: Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 2 2 2 ...
 $ Yield     : num [1:42] 7709 4120 5545 6989 7249 ...
 $ Abundance : int [1:42] 1 8 0 2 4 7 13 12 2 3 ...

str function showing the type of data and quick view of data

         System   Fertiliser     Yield         Abundance     
 diversified:19   no :24     Min.   : 3119   Min.   : 0.000  
 monoculture:23   yes:18     1st Qu.: 5001   1st Qu.: 3.000  
                             Median : 6166   Median : 6.500  
                             Mean   : 6374   Mean   : 6.333  
                             3rd Qu.: 7696   3rd Qu.:10.000  
                             Max.   :10414   Max.   :16.000

summary function providing summary statics for numerical and categorical variables

Data summary

Response and Predictors

Response

Yield (kg/ha): This variable is influenced by the predictors and represents crop productivity

Predictors

Farming system: assessing if the type of farming system (diversified or monoculture) has an affect on yield
Fertiliser use: assessing if fertiliser application affects yield
Species abundance: assessing if increased species abundance affects yield

Graphical & Numerical summaries

Table 1: mean, median, and standard deviation (SD) of Yield for Monocultures fertilised and unfertilised and Diversified farms fertilised and unfertilised.

# A tibble: 4 × 5
# Groups:   System, Fertiliser [4]
  System      Fertiliser Mean_Yield Median_Yield SD_Yield
  <fct>       <fct>           <dbl>        <dbl>    <dbl>
1 diversified no              7561.        7659.    1124.
2 diversified yes             8313.        8200.    1069.
3 monoculture no              4779.        4498.     695.
4 monoculture yes             5591.        5780.    1242.

Summary Observations

Diversified farms:

They had a Higher mean in both Fertilised and un fertilised farms showing that they had a higher yield compared to monoculture farms.
On average yield increased by 752.5 kg/ha when fertiliser was used.
The standard deviation was lower in the fertilised farms compared to the unfertilised farms showing that the yield was more consistent and less variability in the fertilised farms.

Monoculture farms:

Fertilised farms had a higher mean yield compared to unfertilized farms.
On average yield increased by 811.96 kg/ha when fertiliser was used.
The standard deviation was higher in the fertilised farms compared to the unfertilised farms showing that the yield is more variable in the fertilised farms.

Overall: Fertilised farms had a higher mean yield compared to unfertilised farms (Table 1), but its affects varry between the two farming systems. Diversity had a more consistent (less variable) positive impact on yield in both fertilised and unfertilised farms, whereas monoculture farms yield was a lot more variable.

Figure 1: Boxplot comparing Diversified and monoculture farms with and without using fertiliser impacting yield.

Boxplot observations

The highest median yield is observed to be Diversified with fertilisation, however is observed to have the widest IQR indicating more variability in yields. On the contrary monoculture without fertilisation has the lowest median yield and narrowest IQR suggesting more consistency in yield despite lowest.
even without fertiliser diversified farming has higher median yield than monoculture. The yield gap widens when fertiliser is applied, reinforcing the notion that diversification may enhance productivity beyond the sole effect of ferilisation.
Diversified farming systems outperform monoculture systems in yield both with and without fertilisation. additionally fertilisation significantly increases yield however noticeably benefits are more evident in diversified systems.

Figure 2: Scatter plot showing potential relationships between biodiversity and yield.

The correlation between Yield and Abundance is -0.6946344

Correlation observations

There is a moderate negative correlation between abundance and yield (Figure 2 and correlation coefficient). The negative correlation shows as biodiversity (abundance) increases, yield tends to decrease being true for monoculture and diversified (Figure 2 ).
The strength of the relationship is moderate, therefore looking at the data set farms with higher biodiversity tend to have lower crop yields.

Discussion

This data set consists of categorical and numerical data which therefore influences the choice of statistical methods. approaches like ANOVA, t-test or regression models would work to analyse yield across different farming systems and fertiliser usage treatments. When assessing the boxplot distribution varied, therefore assumption of homogeneous needs to be checked before applying a parametric test to ensure that potential skewness or outliers doesn’t affect tests that require normality.

The negative correlation between abundance and yield suggests a possible trade-off between biodiversity and agricultural productivity which could be further explored. Instead of relying solely on correlation multiple regression models incorporating abundance, farming system and fertiliser use can be used to provide deeper insight into how biodiversity interacts with yield, as correlation does not mean causation additional factors like soil quality, location and management practices can be influencing this relationship.

Within the data set there could be a violation of Normality Assumption. yield data may not be normally distributed and therefore Shaprio-Wilk test of QQplots could be used to assess normality of yield. If found that data violates this assumption then log or square root transformations may need to be used.

Standard Deviation indicate different levels of variability between groups this can impact statistical analysis which assumes equal variance across groups. Levene’s test can determine if variances are significantly different across group.

This data set requires careful consideration of normality, variance homogeneity and outliers before applying further statistical tests. By using data transformations and robust statistical approaches the relationships between farming systems, fertilisation, abundance and yield can be analysed more effectively leading to valid and reliable conclusions about potential impacts of farming system and fertiliser usage on agricultural productivity.

Acknowledgements & statement of originality

This report was written in RStudio and Quarto. I started by reading the data paper for Jones et al. (2021) to understand the context of the data. Following this I loaded all my data and opened it in R to have a look at what the variables are and what the numbers were looking like. After this I then started with a summary and structure to look at the data, continuing from this i tried to create as many plot as I could. Chat GPT was used in assistance to make multiple data into one boxplot and how to code the scatterplot. Once Visualisation was complete I went back and wrote my introduction to the paper knowing what my basis was as well as understanding what what my data looked like in comparison to preconceived ideas. finally the discussion was written to summaries the data and what the implications of the data were.

Other Resources that were used to complete this report:

ChatGPT - was used to help understand boxplots and scatterplots. it was also used for my introduction to change words to make it more of a scientific introduction.
ENVX1001 - I used this to go back and look at my older projects and work out what code i could transfer over, as well as structuring the order of the data outputs and exploration.
Google - this was used to research different measures of biodiversity and work out what would be the best unit measurement for this project. I came to the conclusion that abundance would be as it seemed to be the most workable value with my data.

References

OpenAI. (2023). ChatGPT (Mar 14 version)

22.2: Diversity Indices 2022, Biology LibreTexts, viewed 18 March 2025

‌

Appendix

Code

# Simple Linear Regression Model
model <- lm(Yield ~ Abundance, data = crop)
summary(model)


Call:
lm(formula = Yield ~ Abundance, data = crop)

Residuals:
     Min       1Q   Median       3Q      Max 
-3067.90  -668.73   -94.42   662.96  3103.31 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8152.60     351.15  23.217  < 2e-16 ***
Abundance    -280.78      45.98  -6.107 3.34e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1272 on 40 degrees of freedom
Multiple R-squared:  0.4825,    Adjusted R-squared:  0.4696 
F-statistic:  37.3 on 1 and 40 DF,  p-value: 3.341e-07

Code

model <- lm(Yield ~ Fertiliser, data = crop)
summary(model)


Call:
lm(formula = Yield ~ Fertiliser, data = crop)

Residuals:
    Min      1Q  Median      3Q     Max 
-3681.9 -1444.1  -270.5  1274.4  3612.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6054.2      352.5  17.173   <2e-16 ***
Fertiliseryes    746.9      538.5   1.387    0.173    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1727 on 40 degrees of freedom
Multiple R-squared:  0.04588,   Adjusted R-squared:  0.02203 
F-statistic: 1.924 on 1 and 40 DF,  p-value: 0.1731

Code

# Fit the multivariate linear regression model
model_multi <- lm(Yield ~ Fertiliser + System + Abundance, data = crop)

# Display model summary
summary(model_multi)


Call:
lm(formula = Yield ~ Fertiliser + System + Abundance, data = crop)

Residuals:
     Min       1Q   Median       3Q      Max 
-2589.20  -544.59   -26.56   496.37  2093.17 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7622.47     308.14  24.737  < 2e-16 ***
Fertiliseryes       820.67     327.33   2.507  0.01656 *  
Systemmonoculture -2448.31     675.11  -3.627  0.00084 ***
Abundance           -40.92      79.21  -0.517  0.60848    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1026 on 38 degrees of freedom
Multiple R-squared:  0.6799,    Adjusted R-squared:  0.6547 
F-statistic: 26.91 on 3 and 38 DF,  p-value: 1.663e-09

Code

# Load necessary library
library(ggplot2)

# Create histogram for Yield
ggplot(crop, aes(x = Yield)) +
  geom_histogram(binwidth = 500, fill = "lightblue", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Yield",
       x = "Yield (kg/ha)",
       y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5))

Code

# Load necessary library
library(ggplot2)

# Create side-by-side histogram for Yield by Farming System
ggplot(crop, aes(x = Yield, fill = System)) +
  geom_histogram(binwidth = 500, color = "black", alpha = 0.7) +
  theme_minimal() +
  facet_wrap(~ System) +  # Creates separate histograms for each system
  labs(title = "Distribution of Yield for Diversified and Monoculture Farms",
       x = "Yield (kg/ha)",
       y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5))

Session information

Do not delete this section. We need this information for reproducibility and integrity checks.

=== Integrity Check Report ===

Time of execution: 2025-03-23 15:21:18

Last modified: 2025-03-23 15:21:17

File creation: 2025-03-23 15:21:17

Data hash: a1d8ce82f69bfc1110018ed0def8f9cf03c82d30594c03f112a025c07d97c2a9

File hash: 065e1a9322fcd2b4c269e3f432efda41f556c891ea5edd795d70c74b40f8795d

=== Environment Information ===

Working directory: /Users/sophiatweed/Library/CloudStorage/OneDrive-TheUniversityofSydney(Students)/Year 2/ENVX2001/ENVX2001-project1-template

User: sophiatweed

Home directory: /Users/sophiatweed

Language: en_US.UTF-8

=== R Session Information ===

R version: R version 4.4.2 (2024-10-31)

RStudio version: Not running in RStudio


=== System Information ===

Operating system: Darwin

OS version: Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:06 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T8103

Machine type: arm64

Node name: Sophias-Air-2.modem

=== Loaded Packages ===

            Package Version Attached
digest       digest  0.6.37      Yes
lubridate lubridate   1.9.4      Yes
forcats     forcats   1.0.0      Yes
stringr     stringr   1.5.1      Yes
dplyr         dplyr   1.1.4      Yes
purrr         purrr   1.0.4      Yes
readr         readr   2.1.5      Yes
tidyr         tidyr   1.3.1      Yes
tibble       tibble   3.2.1      Yes
ggplot2     ggplot2   3.5.1      Yes
tidyverse tidyverse   2.0.0      Yes