My SID is: 540701410
Project 1 Report
ENVX2001 Applied Statistical Methods
Introduction
Jones et al (2021) conducted a systematic study examining the effects of diversified farming effects on biodiversity and yield, comparing diversified farming systems, to simplified farming systems using the global database. Their research assessed trade-offs of farm diversification across 48 countries, incorporating a range of production systems and geographical contexts. The study aimed to determine the extent to which diversified farming contributes to both biodiversity and agricultural productivity.
This report is based on a simulated dataset designed to reflect key trends and analytical approaches in Jones et al. 2021 study. The primary objective is to evaluate the relationship between farming system type, fertiliser application and species abundance on crop yield using statistical methods. The analysis provides insights into how different farming practices (systems and fertiliser usage) could have the potential role to impact agricultural productivity.
The dataset consists of 42 observation and 4 key variables, examining the relationship farming system, fertiliser use, abundance and crop yield. Each row represents an individual farm with recorded measurements for farming system either categorised as monoculture or diversified, a binary “yes” or “no” for whether fertiliser was applied, a recorded crop Yield (kg/ha) and abundance categorised as a numerical value of species abundance (count of species observed). This dataset contained both categorical variables (system and fertiliser) and numeric values (Yield and abundance). Since no specific units of measurement were provided, this report assigns yield to have a unit of measurement of kg/ha and Abundance to be in relation to species abundance referring to a count of observed species.
Initial exploration using str(crop)
and summary(crop)
confirmed all variables were present with no missing values (Figure 1 & 2). However, as there were two categorical variables the mutate function was used to get a numerical 1 or 2 for system (1 = diversified, 2= monoculture) and for fertiliser use (1 = no, 2= yes). This transformation ensured compatibility with multivariate statistical methods used in the subsequent analyses. Overall data summaries and discussions can be used to understand the potential impacts of farming system and fertiliser usage on agricultural productivity.
tibble [42 × 4] (S3: tbl_df/tbl/data.frame)
$ System : Factor w/ 2 levels "diversified",..: 1 2 1 1 1 2 2 2 1 1 ...
$ Fertiliser: Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 2 2 2 ...
$ Yield : num [1:42] 7709 4120 5545 6989 7249 ...
$ Abundance : int [1:42] 1 8 0 2 4 7 13 12 2 3 ...
str
function showing the type of data and quick view of data
System Fertiliser Yield Abundance
diversified:19 no :24 Min. : 3119 Min. : 0.000
monoculture:23 yes:18 1st Qu.: 5001 1st Qu.: 3.000
Median : 6166 Median : 6.500
Mean : 6374 Mean : 6.333
3rd Qu.: 7696 3rd Qu.:10.000
Max. :10414 Max. :16.000
summary
function providing summary statics for numerical and categorical variables
Data summary
Response and Predictors
Response
- Yield (kg/ha): This variable is influenced by the predictors and represents crop productivity
Predictors
Farming system: assessing if the type of farming system (diversified or monoculture) has an affect on yield
Fertiliser use: assessing if fertiliser application affects yield
Species abundance: assessing if increased species abundance affects yield
Graphical & Numerical summaries
Table 1: mean, median, and standard deviation (SD) of Yield for Monocultures fertilised and unfertilised and Diversified farms fertilised and unfertilised.
# A tibble: 4 × 5
# Groups: System, Fertiliser [4]
System Fertiliser Mean_Yield Median_Yield SD_Yield
<fct> <fct> <dbl> <dbl> <dbl>
1 diversified no 7561. 7659. 1124.
2 diversified yes 8313. 8200. 1069.
3 monoculture no 4779. 4498. 695.
4 monoculture yes 5591. 5780. 1242.
Summary Observations
Diversified farms:
- They had a Higher mean in both Fertilised and un fertilised farms showing that they had a higher yield compared to monoculture farms.
- On average yield increased by 752.5 kg/ha when fertiliser was used.
- The standard deviation was lower in the fertilised farms compared to the unfertilised farms showing that the yield was more consistent and less variability in the fertilised farms.
Monoculture farms:
- Fertilised farms had a higher mean yield compared to unfertilized farms.
- On average yield increased by 811.96 kg/ha when fertiliser was used.
- The standard deviation was higher in the fertilised farms compared to the unfertilised farms showing that the yield is more variable in the fertilised farms.
Overall: Fertilised farms had a higher mean yield compared to unfertilised farms (Table 1), but its affects varry between the two farming systems. Diversity had a more consistent (less variable) positive impact on yield in both fertilised and unfertilised farms, whereas monoculture farms yield was a lot more variable.
Boxplot observations
The highest median yield is observed to be Diversified with fertilisation, however is observed to have the widest IQR indicating more variability in yields. On the contrary monoculture without fertilisation has the lowest median yield and narrowest IQR suggesting more consistency in yield despite lowest.
even without fertiliser diversified farming has higher median yield than monoculture. The yield gap widens when fertiliser is applied, reinforcing the notion that diversification may enhance productivity beyond the sole effect of ferilisation.
Diversified farming systems outperform monoculture systems in yield both with and without fertilisation. additionally fertilisation significantly increases yield however noticeably benefits are more evident in diversified systems.
The correlation between Yield and Abundance is -0.6946344
Correlation observations
There is a moderate negative correlation between abundance and yield (Figure 2 and correlation coefficient). The negative correlation shows as biodiversity (abundance) increases, yield tends to decrease being true for monoculture and diversified (Figure 2 ).
The strength of the relationship is moderate, therefore looking at the data set farms with higher biodiversity tend to have lower crop yields.
Discussion
This data set consists of categorical and numerical data which therefore influences the choice of statistical methods. approaches like ANOVA, t-test or regression models would work to analyse yield across different farming systems and fertiliser usage treatments. When assessing the boxplot distribution varied, therefore assumption of homogeneous needs to be checked before applying a parametric test to ensure that potential skewness or outliers doesn’t affect tests that require normality.
The negative correlation between abundance and yield suggests a possible trade-off between biodiversity and agricultural productivity which could be further explored. Instead of relying solely on correlation multiple regression models incorporating abundance, farming system and fertiliser use can be used to provide deeper insight into how biodiversity interacts with yield, as correlation does not mean causation additional factors like soil quality, location and management practices can be influencing this relationship.
Within the data set there could be a violation of Normality Assumption. yield data may not be normally distributed and therefore Shaprio-Wilk test of QQplots could be used to assess normality of yield. If found that data violates this assumption then log or square root transformations may need to be used.
Standard Deviation indicate different levels of variability between groups this can impact statistical analysis which assumes equal variance across groups. Levene’s test can determine if variances are significantly different across group.
This data set requires careful consideration of normality, variance homogeneity and outliers before applying further statistical tests. By using data transformations and robust statistical approaches the relationships between farming systems, fertilisation, abundance and yield can be analysed more effectively leading to valid and reliable conclusions about potential impacts of farming system and fertiliser usage on agricultural productivity.
Acknowledgements & statement of originality
This report was written in RStudio and Quarto. I started by reading the data paper for Jones et al. (2021) to understand the context of the data. Following this I loaded all my data and opened it in R to have a look at what the variables are and what the numbers were looking like. After this I then started with a summary and structure to look at the data, continuing from this i tried to create as many plot as I could. Chat GPT was used in assistance to make multiple data into one boxplot and how to code the scatterplot. Once Visualisation was complete I went back and wrote my introduction to the paper knowing what my basis was as well as understanding what what my data looked like in comparison to preconceived ideas. finally the discussion was written to summaries the data and what the implications of the data were.
Other Resources that were used to complete this report:
ChatGPT - was used to help understand boxplots and scatterplots. it was also used for my introduction to change words to make it more of a scientific introduction.
ENVX1001 - I used this to go back and look at my older projects and work out what code i could transfer over, as well as structuring the order of the data outputs and exploration.
Google - this was used to research different measures of biodiversity and work out what would be the best unit measurement for this project. I came to the conclusion that abundance would be as it seemed to be the most workable value with my data.
References
OpenAI. (2023). ChatGPT (Mar 14 version)
22.2: Diversity Indices 2022, Biology LibreTexts, viewed 18 March 2025
Appendix
Code
# Simple Linear Regression Model
<- lm(Yield ~ Abundance, data = crop)
model summary(model)
Call:
lm(formula = Yield ~ Abundance, data = crop)
Residuals:
Min 1Q Median 3Q Max
-3067.90 -668.73 -94.42 662.96 3103.31
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8152.60 351.15 23.217 < 2e-16 ***
Abundance -280.78 45.98 -6.107 3.34e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1272 on 40 degrees of freedom
Multiple R-squared: 0.4825, Adjusted R-squared: 0.4696
F-statistic: 37.3 on 1 and 40 DF, p-value: 3.341e-07
Code
<- lm(Yield ~ Fertiliser, data = crop)
model summary(model)
Call:
lm(formula = Yield ~ Fertiliser, data = crop)
Residuals:
Min 1Q Median 3Q Max
-3681.9 -1444.1 -270.5 1274.4 3612.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6054.2 352.5 17.173 <2e-16 ***
Fertiliseryes 746.9 538.5 1.387 0.173
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1727 on 40 degrees of freedom
Multiple R-squared: 0.04588, Adjusted R-squared: 0.02203
F-statistic: 1.924 on 1 and 40 DF, p-value: 0.1731
Code
# Fit the multivariate linear regression model
<- lm(Yield ~ Fertiliser + System + Abundance, data = crop)
model_multi
# Display model summary
summary(model_multi)
Call:
lm(formula = Yield ~ Fertiliser + System + Abundance, data = crop)
Residuals:
Min 1Q Median 3Q Max
-2589.20 -544.59 -26.56 496.37 2093.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7622.47 308.14 24.737 < 2e-16 ***
Fertiliseryes 820.67 327.33 2.507 0.01656 *
Systemmonoculture -2448.31 675.11 -3.627 0.00084 ***
Abundance -40.92 79.21 -0.517 0.60848
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1026 on 38 degrees of freedom
Multiple R-squared: 0.6799, Adjusted R-squared: 0.6547
F-statistic: 26.91 on 3 and 38 DF, p-value: 1.663e-09
Code
# Load necessary library
library(ggplot2)
# Create histogram for Yield
ggplot(crop, aes(x = Yield)) +
geom_histogram(binwidth = 500, fill = "lightblue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Yield",
x = "Yield (kg/ha)",
y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
Code
# Load necessary library
library(ggplot2)
# Create side-by-side histogram for Yield by Farming System
ggplot(crop, aes(x = Yield, fill = System)) +
geom_histogram(binwidth = 500, color = "black", alpha = 0.7) +
theme_minimal() +
facet_wrap(~ System) + # Creates separate histograms for each system
labs(title = "Distribution of Yield for Diversified and Monoculture Farms",
x = "Yield (kg/ha)",
y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
Do not delete this section. We need this information for reproducibility and integrity checks.
=== Integrity Check Report ===
Time of execution: 2025-03-23 15:21:18
Last modified: 2025-03-23 15:21:17
File creation: 2025-03-23 15:21:17
Data hash: a1d8ce82f69bfc1110018ed0def8f9cf03c82d30594c03f112a025c07d97c2a9
File hash: 065e1a9322fcd2b4c269e3f432efda41f556c891ea5edd795d70c74b40f8795d
=== Environment Information ===
Working directory: /Users/sophiatweed/Library/CloudStorage/OneDrive-TheUniversityofSydney(Students)/Year 2/ENVX2001/ENVX2001-project1-template
User: sophiatweed
Home directory: /Users/sophiatweed
Language: en_US.UTF-8
=== R Session Information ===
R version: R version 4.4.2 (2024-10-31)
RStudio version: Not running in RStudio
=== System Information ===
Operating system: Darwin
OS version: Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:06 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T8103
Machine type: arm64
Node name: Sophias-Air-2.modem
=== Loaded Packages ===
Package Version Attached
digest digest 0.6.37 Yes
lubridate lubridate 1.9.4 Yes
forcats forcats 1.0.0 Yes
stringr stringr 1.5.1 Yes
dplyr dplyr 1.1.4 Yes
purrr purrr 1.0.4 Yes
readr readr 2.1.5 Yes
tidyr tidyr 1.3.1 Yes
tibble tibble 3.2.1 Yes
ggplot2 ggplot2 3.5.1 Yes
tidyverse tidyverse 2.0.0 Yes