Project 1 Report

ENVX2001 Applied Statistical Methods

Published

March 28, 2025

Code

cat(paste0("My SID is: ", SID)) # DO NOT EDIT THIS LINE

My SID is: 540741612

Introduction

Data summary

Code

crop

# A tibble: 42 × 4
   System      Fertiliser Yield Abundance
   <chr>       <chr>      <dbl>     <int>
 1 diversified yes        7835.         3
 2 monoculture no         3874.         9
 3 monoculture yes        5890.        11
 4 diversified yes        7559.         2
 5 monoculture yes        4708.        10
 6 monoculture yes        6446.        10
 7 diversified yes        7966.         1
 8 monoculture no         4402.         9
 9 monoculture no         3683.        12
10 monoculture no         3882.         9
# ℹ 32 more rows

Code

str(crop)

tibble [42 × 4] (S3: tbl_df/tbl/data.frame)
 $ System    : chr [1:42] "diversified" "monoculture" "monoculture" "diversified" ...
 $ Fertiliser: chr [1:42] "yes" "no" "yes" "yes" ...
 $ Yield     : num [1:42] 7835 3874 5890 7559 4708 ...
 $ Abundance : int [1:42] 3 9 11 2 10 10 1 9 12 9 ...

Code

fertiliser <- as.factor(crop$Fertiliser)
culture <- as.factor(crop$System)

a. Identify the response and predictor variables in the dataset. The response variables are $Yield and $Abundance. The predictor variables are $System and $Fertiliser.

Code

crop # display the data

# A tibble: 42 × 4
   System      Fertiliser Yield Abundance
   <chr>       <chr>      <dbl>     <int>
 1 diversified yes        7835.         3
 2 monoculture no         3874.         9
 3 monoculture yes        5890.        11
 4 diversified yes        7559.         2
 5 monoculture yes        4708.        10
 6 monoculture yes        6446.        10
 7 diversified yes        7966.         1
 8 monoculture no         4402.         9
 9 monoculture no         3683.        12
10 monoculture no         3882.         9
# ℹ 32 more rows

Code

library(ggplot2)

# Yield distribution by System
ggplot(crop, aes(x = Yield, fill = System)) +
  geom_histogram(bins = 15, alpha = 0.7, position = "identity") +
  labs(title = "Yield Distribution by System FIG. 1", x = "Yield", y = "Count") +
  theme_minimal()

Code

# Abundance distribution by System
ggplot(crop, aes(x = Abundance, fill = System)) +
  geom_bar(position = "dodge") +
  labs(title = "Abundance Distribution by System FIG. 2", x = "Abundance", y = "Count") +
  theme_minimal()

Code

# Yield vs Abundance scatter plot
ggplot(crop, aes(x = Abundance, y = Yield, color = System, shape = Fertiliser)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(title = "Yield vs Abundance FIG. 3", x = "Abundance", y = "Yield") +
  theme_minimal() +
  scale_color_brewer(palette = "Set2")

Code

# Yield vs Fertiliser
ggplot(crop, aes(x = Fertiliser, y = Yield, fill = Fertiliser)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Yield by Fertiliser Use FIG. 4", x = "Fertiliser", y = "Yield") +
  theme_minimal()

Code

# Abundance vs Fertiliser
ggplot(crop, aes(x = Fertiliser, y = Abundance, fill = Fertiliser)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Abundance by Fertiliser Use FIG. 5", x = "Fertiliser", y = "Abundance") +
  theme_minimal()

Code

#Analysis of skew
# Histogram for Yield
ggplot(crop, aes(x = Yield)) +
  geom_histogram(bins = 15, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of Yield FIG. 6", x = "Yield", y = "Count") +
  theme_minimal()

Code

# Histogram for Abundance
ggplot(crop, aes(x = Abundance)) +
  geom_histogram(bins = 15, fill = "tomato", color = "black", alpha = 0.7) +
  labs(title = "Histogram of Abundance FIG. 7", x = "Abundance", y = "Count") +
  theme_minimal()

Discussion

##Discuss the implications of the data structure and distribution for data analysis.

Points to consider:

What are the implications of the data structure and distribution for data analysis?

The first point the data implies is that fertiliser did not significantly effect the results of abundance in either monoculter or diversified crops seen in FIG 5. However it does effect yield as seen in FIG. 4 crops with fertiliser have a significatly higher yield. Than crops without fertiliser.

Comparing both cultures we can see that a monoculture will have a lower yield and higher abundance overall. In comparison to a diversified crop which have a significantly greater yield and significantly lower abundance that a monoculture

What are the potential challenges in analysing this dataset?

The potential challenges with analysing this dataset are the variance caused by fertiliser as it could be an equalising factor, effecting a one culture more that another. Furthermore, abundance may effect yield indirectly. Finally, one value may be significantly underrepresented eg. “fertiliser” (yes/no) if this occurs a higher variance may skew results.

How might you address these challenges? Using an ANOVA model and other similar tests eg. Poisson distribution or the Wilcoxon rank sun test to directly address the possible challenges to the variance in values and means. Transformations I would need to use include a sqrt(y) transformation to stabilise variance. Also, due to the bimodal distribution of the data seen in FIG. 6 and FIG. 7, I would use a log(y+1) transformation to account for values that are too high.

Acknowledgements & statement of originality

I received assistance from CHATGPT to write the code used to visualise this dataset and to fix a few issues with my code when running and saving. b. If you used AI tools, create a list of the tools you used and provide a brief description of how you used them, including your prompts and questions. I used CHATGPT, these are the questions I asked it;

This report was written in RStudio and Quarto. I started by reading the data paper for Jones et al. (2021) to understand the context of the data. The I used CHATGPT to visualise the data allowing me to explore the values in more detail and help me fix bugs in my code.

To complete the report the following resources and tools are used:

Jones et. al (2021), Tutorials/Lectures, CHATGPT. CHATGPT prompts after loading data: “Hi CHATGPT can you visualise this dataset. Include at least one histogram, one boxplot and one geometric plot.”, can you change the yield vs abundance plot to differenciate fertiliser by colour and system by shape”, “can you compare fertiliser to yield and fertiliser to abundance in a 1 to 1 rationalised format”

Failure to journal your use of AI tools appropriately will result in a fail. However, the use of these tools is entirely optional and you can complete the report without them, since we provide enough information from lectures, tutorials and labs for you to complete the report. –>

References

Provide a list of references, if any, that you used in completing the report. There is no specific reference style required, but you should be consistent in your formatting.

Jones, S. K., Sánchez, A. C., Juventia, S. D., & Estrada-Carmona, N. (2021). A global database of diversified farming effects on biodiversity and yield. Scientific Data, 8(1). https://doi.org/10.1038/s41597-021-01000-y

OpenAI. (2023). ChatGPT (Mar 14 version) [Large language model]. https://chatgpt.com

Session information

Do not delete this section. We need this information for reproducibility and integrity checks.

=== Integrity Check Report ===

Time of execution: 2025-03-28 13:32:27

Last modified: 2025-03-28 13:32:22

File creation: 2025-03-28 13:32:22

Data hash: a70c4842bf2394ce5cd665dcd388b5b52ae0e9de7bb8bad68d9c7324e0e2fabd

File hash: 5b390a83fff2720db34297a77aa89b262f228fb616e257e11df985c32dd0ef1a

=== Environment Information ===

Working directory: C:/Users/liamw/OneDrive/Desktop/ENVX2001/ENVX2001-project1-template

User:

Home directory: C:\Users\liamw\OneDrive\Documents

Language:

=== R Session Information ===

R version: R version 4.3.2 (2023-10-31 ucrt)

RStudio version: Not running in RStudio


=== System Information ===

Operating system: Windows

OS version: build 26100

Machine type: x86-64

Node name: LAPTOP-SI6KI3AM

=== Loaded Packages ===

            Package Version Attached
digest       digest  0.6.35      Yes
lubridate lubridate   1.9.3      Yes
forcats     forcats   1.0.0      Yes
stringr     stringr   1.5.1      Yes
dplyr         dplyr   1.1.4      Yes
purrr         purrr   1.0.2      Yes
readr         readr   2.1.5      Yes
tidyr         tidyr   1.3.1      Yes
tibble       tibble   3.2.1      Yes
ggplot2     ggplot2   3.5.1      Yes
tidyverse tidyverse   2.0.0      Yes