# Question 1A
library(readxl)
squirrels <- read_excel("Assignment_1_data.xls", sheet = "Squirrels")
par(mfrow = c(1,4))
# female histogram
hist(squirrels$FEMALE,
main = "Female squirrel weights",
xlab = "Weight",
col = "lightblue")
# male histogram
hist(squirrels$MALE,
main = "Male squirrel weights",
xlab = "Weight",
col = "lightgreen")
# female boxplot
boxplot(squirrels$FEMALE,
col = "lightblue",
main = "Female weights",
ylab = "Weight")
# male boxplot
boxplot(squirrels$MALE,
col = "lightgreen",
main = "Male weights",
ylab = "Weight")
# Question 1B
# two-sample t-test
t_test <- t.test(squirrels$FEMALE, squirrels$MALE)
# mann-whitney test
mw_test <- wilcox.test(squirrels$FEMALE, squirrels$MALE)
# show results
t_test
##
## Welch Two Sample t-test
##
## data: squirrels$FEMALE and squirrels$MALE
## t = -2.1662, df = 85.212, p-value = 0.03309
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.139618083 -0.005981917
## sample estimates:
## mean of x mean of y
## 0.5180 0.5908
mw_test
##
## Wilcoxon rank sum test with continuity correction
##
## data: squirrels$FEMALE and squirrels$MALE
## W = 1012, p-value = 0.1014
## alternative hypothesis: true location shift is not equal to 0
1C. The histograms show that neither female nor male squirrel weights follow a perfectly normal distribution, with female weights being slightly more right-skewed. The boxplots also show outliers and differences in spread between the groups. Since it doesn’t follow normality and equal variance assumptions of the t-test, a non-parametric test would be more suitable.
1D. Two-sample t-test: t (df = 85.212) = -2.17, p-value =
0.03309
Since p < 0.05, this suggests that the mean weights of female and
male squirrels differ significantly. The negative t-score indicates that
male squirrels have a higher mean weight than females.
1E. Mann–Whitney test: W = 1012, p = 0.10 Since p > 0.05, there is no statistically significant difference in the weight distributions between female and male squirrels.
1F. Using a t-test here risks a Type I error (finding a significant difference in squirrel weights when there may be none).
# Question 2
melons <- read_excel("Assignment_1_data.xls", sheet = "Melons")
2A. Null hypothesis: The mean yield is the same for all four melon varieties.
Alternative hypothesis: At least one variety has a different mean yield compared to the others.
# Question 2B
library(ggplot2)
# convert VARIETY to a factor so ggplot treats it as categories to plot 4 separate boxplots
melons$VARIETY <- as.factor(melons$VARIETY)
ggplot(melons, aes(x = VARIETY, y = YIELDM, fill = VARIETY)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(title = "Melon yields by variety",
x = "Variety",
y = "Yield (kg per plot)") +
theme_minimal()
# Question 2C
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” lubridate 1.9.3 âś” tibble 3.2.1
## âś” purrr 1.0.2 âś” tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
melons %>%
group_by(VARIETY) %>% # group data into 4 varieties
summarise(
mean = mean(YIELDM),
lower_CI = t.test(YIELDM)$conf.int[1],
upper_CI = t.test(YIELDM)$conf.int[2]
)
## # A tibble: 4 Ă— 4
## VARIETY mean lower_CI upper_CI
## <fct> <dbl> <dbl> <dbl>
## 1 1 20.5 15.6 25.4
## 2 2 37.4 33.3 41.5
## 3 3 20.5 12.9 28.0
## 4 4 29.9 27.6 32.2
# Question 2D & 2E
anova <- aov(YIELDM ~ VARIETY, data = melons)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## VARIETY 3 1115.3 371.8 23.8 1.73e-06 ***
## Residuals 18 281.2 15.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(1,2))
plot(anova)
2E. Residuals vs Fitted: Can assume a linear relationship, as there is no distinct pattern for the residuals. QQ Residuals: Can assume normal distribution, with only small deviations at the extreme ends. Scale-Location: Slight heteroscedasticity. It shows slightly greater spread at low and high fitted values compared to the middle (unequal variance) Residuals vs Leverage: No influential outliers, as there are no points beyond Cook’s distance lines.
2F. One-way ANOVA showed a highly significant effect of variety on yield, F(3,18) = 23.8, p < 0.001. This shows that the mean yields differ among melon varieties.
# Question 3A
trees <- read_excel("Assignment_1_data.xls", sheet = "Dioecious trees")
library(ggplot2)
trees$SEX <- factor(trees$SEX, labels = c("Male", "Female"))
ggplot(trees, aes(x = SEX, y = FLOWERS, fill = SEX)) +
geom_boxplot(alpha = 0.6) +
geom_point(alpha = 0.6) +
labs(title = "Number of flowers by sex",
x = "Sex",
y = "Number of flowers") +
scale_fill_manual(values = c("Male" = "skyblue", "Female" = "pink")) +
theme_minimal()
# Question 3B
# Testing for distribution
par(mfrow = c(1,2))
# Male trees only
qqnorm(trees$FLOWERS[trees$SEX == "Male"],
main = "Normal Q-Q Plot: Male Trees")
qqline(trees$FLOWERS[trees$SEX == "Male"], col = "red")
# Female trees only
qqnorm(trees$FLOWERS[trees$SEX == "Female"],
main = "Normal Q-Q Plot: Female Trees")
qqline(trees$FLOWERS[trees$SEX == "Female"], col = "red")
Since the data deviates from normal distribution, and the y-variable (number of flowers) is a count while x-variable (male or female trees) is categorical with 2 variables, we will use a non-parametric test (Mann-Whitney U-test).
# Question 3B
mw_test <- wilcox.test(FLOWERS ~ SEX, data = trees, exact = FALSE)
mw_test
##
## Wilcoxon rank sum test with continuity correction
##
## data: FLOWERS by SEX
## W = 298, p-value = 0.9763
## alternative hypothesis: true location shift is not equal to 0
3C.
The Mann–Whitney test showed no significant difference in flower numbers between male and female trees (W = 298, p = 0.976). This indicates that the two sexes produce a similar number of flowers in this sample.