data <- "Assignment 1 data.xls"
readxl::excel_sheets(data)

## [1] "Squirrels"       "Melons"          "Dioecious trees"

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)

## Warning: package 'readxl' was built under R version 4.3.3

library(janitor)

## Warning: package 'janitor' was built under R version 4.3.3

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

1 Question 1: Squirrels — female vs male body mass

1A. Plot the data using histogram(s) and boxplot(s).

#Read the "Squirrels" sheet from the Excel 
squirrels <- readxl::read_excel(data, sheet = "Squirrels") |>

janitor::clean_names() %>%          
  rename_with(toupper) %>%            
  pivot_longer(                     
    cols = c(FEMALE, MALE),
    names_to = "SEX",                
    values_to = "WEIGHT") %>%
  mutate(SEX = factor(SEX, levels = c("FEMALE","MALE")))

#Plot histograms by sex (Same as mentioned on our calss-- the clean style)
ggplot(squirrels %>% filter(!is.na(WEIGHT)), aes(x = WEIGHT)) +
  geom_histogram(
    bins = 20,                        # bin count 
    fill = "skyblue", color = "black" # simple colour scheme for clarity in reports
  ) +
  facet_wrap(
    ~SEX, nrow = 1,                   # draw separate panels for FEMALE and MALE
    scales = "free"                   # let each panel have its own x & y ranges
  ) +
  labs(
    title = "Squirrel Weights by Sex – Histogram",
    x = "Body weight", y = "Count"
  ) +
  theme_classic() +                   # white background; no grey grid
  theme(
    strip.background = element_blank(), # remove facet strip boxes around FEMALE/MALE
    strip.text = element_text(face = "bold"),
    axis.title = element_text(size = 12),
    axis.text  = element_text(size = 10)
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0), add = c(0, 1)) # always leave headroom above tallest bar
  ) +
  scale_x_continuous(
    breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2),
    # ^ explicit tick marks so MALE can show larger values ,
    expand = expansion(mult = c(0, 0), add = c(0.02, 0.05)) # add small margins at both ends
  )

#Boxplot by sex 
ggplot(squirrels %>% filter(!is.na(WEIGHT)),
       aes(x = SEX, y = WEIGHT, fill = SEX)) +
  geom_boxplot(width = 0.5, alpha = 0.85, color = "black") +  # tidy, high-contrast boxplot
  labs(title = "Squirrel Weights by Sex – Boxplot",
       x = "Sex", y = "Body weight") +
  theme_classic() +
  theme(legend.position = "none")      # remove redundant legend (SEX is on the x-axis)

The histogram shows that both female and male squirrels exhibit right-skewed body weight distributions, with males displaying a broader spread and a longer tail towards higher values. Female weights mostly range between 0.3 and 0.8, while male weights extend beyond 1.1. This suggests that some male squirrels may attain much higher body mass than females.

The boxplot supports this observation. Although the median weight is only slightly higher in males, the male group shows greater variability. Notably, one male squirrel was flagged as an outlier, appearing as a black dot above the upper whisker. This indicates that its weight lies more than 1.5 times the interquartile range (IQR) above the third quartile, and may influence the results of parametric tests.

1B. Perform a two-sample t-test and a Mann-Whitney test.

#Summarise sample size, mean, and SD for each group
squirrels %>%
  group_by(SEX) %>%
  summarise(
    n    = sum(!is.na(WEIGHT)),             
    mean = mean(WEIGHT, na.rm = TRUE),      
    sd   = sd(WEIGHT, na.rm = TRUE)        
  )

## # A tibble: 2 × 4
##   SEX        n  mean    sd
##   <fct>  <int> <dbl> <dbl>
## 1 FEMALE    50 0.518 0.132
## 2 MALE      50 0.591 0.198

#t-tests for comparison 
t.test(WEIGHT ~ SEX, data = squirrels, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  WEIGHT by SEX
## t = -2.1662, df = 85.212, p-value = 0.03309
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  -0.139618083 -0.005981917
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##               0.5180               0.5908

#Mann-Whitney test  
wilcox.test(WEIGHT ~ SEX, data = squirrels, exact = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  WEIGHT by SEX
## W = 1012, p-value = 0.1014
## alternative hypothesis: true location shift is not equal to 0

We performed both a two-sample t-test and a Mann–Whitney U test to compare the body weight of male and female squirrels. The t-test yielded a p-value below 0.05, suggesting a statistically significant difference in means. The Mann–Whitney test, however, produced a higher p-value. These results will be further interpreted in Questions 1C to 1F, considering the data distribution and potential violations of test assumptions.

1C. What features of the plots suggest that the use of a parametric test may be unwise?

The histogram shows that the body weight distributions for both sexes are right-skewed, especially for males, and the boxplot reveals the presence of at least one clear outlier in the male group. Additionally, the male group exhibits greater variance than the female group. These features — skewness, unequal variances, and outliers — violate the assumptions of normality and homogeneity of variance required for parametric tests such as the t-test. Therefore, the use of a non-parametric test may be more appropriate.

1D. From the results of the (parametric) t-test, report the t, dfs, and p-values, and interpret them.

A two-sample Welch t-test was conducted to compare the mean body weight between female and male squirrels. The result was t = -2.1662, df = 85.212, and p = 0.03309. Since the p-value is below the conventional alpha level of 0.05, we reject the null hypothesis and conclude that there is a statistically significant difference in mean body weight between sexes. On average, male squirrels are heavier than females.

1E. From the results of the (non-parametric) Mann-Whitney test, report the W, and p-values, and interpret them.

A Mann–Whitney U test (Wilcoxon rank-sum test) was performed as a non-parametric alternative to compare body weight distributions. The result was W = 1012, with a p-value = 0.1014. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This indicates that there is no significant difference in body weight distributions between male and female squirrels based on ranks.

1F. What type of error do you get if you use a t-test?

If a t-test is used despite violations of its assumptions (e.g., non-normality, outliers, or unequal variances), it may produce inflated significance. This increases the risk of making a Type I error — falsely rejecting a true null hypothesis and concluding that a difference exists when it does not. In this case, the t-test yielded a significant result (p = 0.033), but the non-parametric test did not (p = 0.101), highlighting the potential for such a Type I error.

2 Question 2: Melons — yield by variety

2A. What is/are the hypothesis/es?

Null hypothesis:There is no difference in the mean yield (YIELDM) among the four melon varieties.

Alternative hypothesis:At least one variety has a different mean yield compared to the others.

2B. Plot the data so that you can see the difference in yield among the varieties.

# Read and tidy the Melons dataset
melons <- readxl::read_excel(data, sheet = "Melons") %>%
  janitor::clean_names() %>%
  rename_with(toupper) %>%
  mutate(VARIETY = factor(VARIETY)) 



# Plot: Yield by Variety
ggplot(melons %>% filter(!is.na(YIELDM)), aes(x = VARIETY, y = YIELDM, fill = VARIETY)) +
  geom_boxplot(width = 0.5, alpha = 0.8, outlier.shape = NA) +       
  labs(title = "Melon Yield by Variety",
       x = "Variety", y = "Yield (kg per plot)") +
  theme_classic() +
  theme(
    legend.position = "none",
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

The boxplot shows clear differences in yield across the four melon varieties. Variety 2 has the highest median yield. Variety 1 and 3 have lower yields with greater spread, and Variety 3 has a smaller sample size, consistent with experimental losses. Variety 4 displays a relatively tight distribution, suggesting stable yield performance across plots.

2C. Produce descriptive statistics that show you the means of the groups, and the confidence intervals.

# Mean + 95% CI plot
melon_summary <- melons %>%
  group_by(VARIETY) %>%
  summarise(
    n    = sum(!is.na(YIELDM)),
    mean = mean(YIELDM, na.rm = TRUE),
    sd   = sd(YIELDM, na.rm = TRUE),
    se   = sd / sqrt(n),
    ci95 = qt(0.975, df = n - 1) * se,
    lower = mean - ci95,
    upper = mean + ci95
  )

# Then plot manually using that table
ggplot(melon_summary, aes(x = VARIETY, y = mean)) +
  geom_point(aes(fill = VARIETY), shape = 21, size = 4, color = "black") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
  labs(title = "Mean Yield with 95% Confidence Intervals",
       x = "Variety", y = "Mean yield (kg per plot)") +
  theme_classic() +
  theme(
    legend.position = "none",
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

The plot shows the mean yield for each melon variety with 95% confidence intervals. Variety 2 has the highest mean yield, followed by Variety 4. Variety 1 and 3 show lower yields. Variety 3 has the widest CI, reflecting uncertainty from its smaller sample size (n = 4). In contrast, Variety 4 has the narrowest CI, indicating more consistent performance across plots.

2D. Produce an analysis of variance of YIELDM with respect to VARIETY.

# Fit linear model
model <- lm(YIELDM ~ VARIETY, data = melons)

anova(model)

## Analysis of Variance Table
## 
## Response: YIELDM
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## VARIETY    3 1115.28  371.76  23.798 1.735e-06 ***
## Residuals 18  281.19   15.62                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result was significant: F(3, 18) = 23.80, p < 0.001. This indicates that mean melon yield differs significantly between at least two varieties.

2E. Show and describe the model’s resulting diagnostic plots.

#Diagnostic plots 
par(mfrow = c(2, 2))
plot(model)

Residuals vs Fitted: With four vertical bands (one per variety) expected for a categorical predictor, the residuals show no clear trend; variance is roughly constant, with a slight increase at the extremes. Normal Q–Q: Points lie close to the diagonal with a mild deviation in the upper tail, indicating approximately normal residuals with a slightly heavy right tail. Scale–Location: The spread of √|standardized residuals| is fairly even across fitted values, showing only a small hint of heteroscedasticity. Residuals vs Leverage: No observations exceed Cook’s distance; there are no unduly influential points. Overall, departures are minor and the ANOVA results can be considered reliable.

2F. Report the F, dfs, and p-values, and interpret them.

A one-way ANOVA was conducted to evaluate whether melon yield differed across the four varieties. The analysis yielded a significant result: F(3, 18) = 23.798, p < 0.001. This provides strong evidence that at least one melon variety has a different mean yield compared to the others. The small p-value (< 0.001) allows us to reject the null hypothesis that all group means are equal.

3 Question 3: Dioecious trees — flowers by sex

3A. Show graphically how the number of flowers differs between the sexes.

#Read in the "Dioecious trees" sheet 
if (!exists("data")) data <- "Assignment 1 data.xls"
trees <- readxl::read_excel(data, sheet = "Dioecious trees") |>

  janitor::clean_names() %>%                    # convert column names to snake_case
  rename_with(toupper) %>%                      # convert all column names to UPPERCASE
  mutate(
    SEX = factor(SEX, levels = c(1, 2),         # convert SEX from numeric to factor
                 labels = c("MALE", "FEMALE"))
  )

#Boxplot + jitter plot to compare flower numbers by sex
ggplot(trees, aes(x = SEX, y = FLOWERS, fill = SEX)) +
  geom_boxplot(width = 0.5, alpha = 0.8, color = "black") +
  labs(
    title = "Number of Flowers by Tree Sex",
    x = "Sex", y = "Number of Flowers"
  ) +
  theme_classic() +
  theme(legend.position = "none")

The boxplot shows that male trees generally have a higher median number of flowers than female trees. The interquartile range for female trees is noticeably larger than that for males, indicating greater variability in the number of flowers. In contrast, the male group shows a more consistent distribution with most values concentrated within a narrower range., while the female group displays more variability, including two high-value outliers.

3B. Test the hypothesis that male and female trees produce different number of flowers.

#Mann–Whitney U test
wilcox.test(FLOWERS ~ SEX, data = trees, exact = FALSE)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FLOWERS by SEX
## W = 298, p-value = 0.9763
## alternative hypothesis: true location shift is not equal to 0

3C. Report the appropriate test statistic, dfs, and p-values, and interpret them.

A Mann–Whitney U test was conducted to compare the number of flowers produced by male and female trees. The test yielded W = 298 and p = 0.9763. Since the p-value is much greater than 0.05, we fail to reject the null hypothesis. This suggests that there is no statistically significant difference in flower production between male and female trees in this sample.

ES3307/ES7028 — Assignment 1 (Code)

Aorui Liu

1 Question 1: Squirrels — female vs male body mass

2 Question 2: Melons — yield by variety

3 Question 3: Dioecious trees — flowers by sex