ES3307 Assignment 1

# Question 1A
library(readxl)
squirrels <- read_excel("Assignment_1_data.xls", sheet = "Squirrels")

par(mfrow = c(1,4))

# female histogram
hist(squirrels$FEMALE,
     main = "Female squirrel weights",
     xlab = "Weight",
     col = "lightblue")

# male histogram
hist(squirrels$MALE,
     main = "Male squirrel weights",
     xlab = "Weight",
     col = "lightgreen")

# female boxplot
boxplot(squirrels$FEMALE,
        col = "lightblue",
        main = "Female weights",
        ylab = "Weight")

# male boxplot
boxplot(squirrels$MALE,
        col = "lightgreen",
        main = "Male weights",
        ylab = "Weight")

# Question 1B

# two-sample t-test
t_test <- t.test(squirrels$FEMALE, squirrels$MALE)

# mann-whitney test
mw_test <- wilcox.test(squirrels$FEMALE, squirrels$MALE)

# show results
t_test

## 
##  Welch Two Sample t-test
## 
## data:  squirrels$FEMALE and squirrels$MALE
## t = -2.1662, df = 85.212, p-value = 0.03309
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.139618083 -0.005981917
## sample estimates:
## mean of x mean of y 
##    0.5180    0.5908

mw_test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  squirrels$FEMALE and squirrels$MALE
## W = 1012, p-value = 0.1014
## alternative hypothesis: true location shift is not equal to 0

1C. The histograms show that neither female nor male squirrel weights follow a perfectly normal distribution, with female weights being slightly more right-skewed. The boxplots also show outliers and differences in spread between the groups. Since it doesn’t follow normality and equal variance assumptions of the t-test, a non-parametric test would be more suitable.

1D. Two-sample t-test: t (df = 85.212) = -2.17, p-value = 0.03309
Since p < 0.05, this suggests that the mean weights of female and male squirrels differ significantly. The negative t-score indicates that male squirrels have a higher mean weight than females.

1E. Mann–Whitney test: W = 1012, p = 0.10 Since p > 0.05, there is no statistically significant difference in the weight distributions between female and male squirrels.

1F. Using a t-test here risks a Type I error (finding a significant difference in squirrel weights when there may be none).

# Question 2

melons <- read_excel("Assignment_1_data.xls", sheet = "Melons")

2A. Null hypothesis: The mean yield is the same for all four melon varieties.

Alternative hypothesis: At least one variety has a different mean yield compared to the others.

# Question 2B

library(ggplot2)

# convert VARIETY to a factor so ggplot treats it as categories to plot 4 separate boxplots
melons$VARIETY <- as.factor(melons$VARIETY)

ggplot(melons, aes(x = VARIETY, y = YIELDM, fill = VARIETY)) +
  geom_boxplot() +
  geom_point(alpha = 0.5) +
  labs(title = "Melon yields by variety",
       x = "Variety",
       y = "Yield (kg per plot)") +
  theme_minimal()

# Question 2C

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

melons %>%
  group_by(VARIETY) %>%       # group data into 4 varieties
  summarise(                  
    mean = mean(YIELDM),     
    lower_CI = t.test(YIELDM)$conf.int[1], 
    upper_CI = t.test(YIELDM)$conf.int[2]
  )

## # A tibble: 4 × 4
##   VARIETY  mean lower_CI upper_CI
##   <fct>   <dbl>    <dbl>    <dbl>
## 1 1        20.5     15.6     25.4
## 2 2        37.4     33.3     41.5
## 3 3        20.5     12.9     28.0
## 4 4        29.9     27.6     32.2

# Question 2D & 2E

anova <- aov(YIELDM ~ VARIETY, data = melons)
summary(anova)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## VARIETY      3 1115.3   371.8    23.8 1.73e-06 ***
## Residuals   18  281.2    15.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow = c(1,2))
plot(anova)

2E. Residuals vs Fitted: Can assume a linear relationship, as there is no distinct pattern for the residuals. QQ Residuals: Can assume normal distribution, with only small deviations at the extreme ends. Scale-Location: Slight heteroscedasticity. It shows slightly greater spread at low and high fitted values compared to the middle (unequal variance) Residuals vs Leverage: No influential outliers, as there are no points beyond Cook’s distance lines.

2F. One-way ANOVA showed a highly significant effect of variety on yield, F(3,18) = 23.8, p < 0.001. This shows that the mean yields differ among melon varieties.

# Question 3A

trees <- read_excel("Assignment_1_data.xls", sheet = "Dioecious trees")

library(ggplot2)

trees$SEX <- factor(trees$SEX, labels = c("Male", "Female"))

ggplot(trees, aes(x = SEX, y = FLOWERS, fill = SEX)) +
  geom_boxplot(alpha = 0.6) +
  geom_point(alpha = 0.6) +
  labs(title = "Number of flowers by sex",
       x = "Sex",
       y = "Number of flowers") +
   scale_fill_manual(values = c("Male" = "skyblue", "Female" = "pink")) +
  theme_minimal()

# Question 3B

# Testing for distribution

par(mfrow = c(1,2))

# Male trees only
qqnorm(trees$FLOWERS[trees$SEX == "Male"],
       main = "Normal Q-Q Plot: Male Trees")
qqline(trees$FLOWERS[trees$SEX == "Male"], col = "red")

# Female trees only
qqnorm(trees$FLOWERS[trees$SEX == "Female"],
       main = "Normal Q-Q Plot: Female Trees")
qqline(trees$FLOWERS[trees$SEX == "Female"], col = "red")

Since the data deviates from normal distribution, and the y-variable (number of flowers) is a count while x-variable (male or female trees) is categorical with 2 variables, we will use a non-parametric test (Mann-Whitney U-test).

# Question 3B


mw_test <- wilcox.test(FLOWERS ~ SEX, data = trees, exact = FALSE)

mw_test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FLOWERS by SEX
## W = 298, p-value = 0.9763
## alternative hypothesis: true location shift is not equal to 0

3C.

The Mann–Whitney test showed no significant difference in flower numbers between male and female trees (W = 298, p = 0.976). This indicates that the two sexes produce a similar number of flowers in this sample.

ES3307 Assignment 1

2025-09-04