# loading required libraries
library(tidyverse)
library(ggplot2)
##Question 1: In a study of squirrels, the weight of 50 female and 50 male squirrels was measured. The observations are in the columns FEMALE and MALE (Squirrels dataset).
#loading squirrel data from excel
library(readxl)
Squirrel_data <- read_excel("C:/Users/LIEWY/Downloads/Assignment 1 data(1)(1).xls", sheet = "Squirrels")
View(Squirrel_data)
1A. Plot the data using histogram(s) and boxplot(s).
#Histogram of male squirrel weight distribution
Squirrel_data %>%
ggplot(aes(x = MALE))+
geom_histogram(binwidth = 0.05, colour = "black", fill = "lightblue")+
labs(x = 'Weight (kg)', y = 'Frequency (n)', title = "Distribution of male squirrel weights")+
theme_minimal()
#histogram of female squirrel weight distribution
Squirrel_data %>%
ggplot(aes(x = FEMALE))+
geom_histogram(binwidth = 0.05, colour = "black", fill = "lightpink")+
labs(x = 'Weight (kg)', y = 'Frequency (n)', title = "Distribution of female squirrel weights")+
theme_minimal()
#data frame currently stores weight data in columns; x variable is sex (categorical), y variable is weight
#x variables are currently just headings amd need to be 'paired' with their respective data points in a combined column
Squirrel_new <- Squirrel_data %>%
pivot_longer(cols = c(MALE, FEMALE), #converts rows (in this case, MALE and FEMALE, into columns
names_to = "Sex", #assigns MALE/FEMALE to their respective data points
values_to = "Weight") #existing weight values are labelled as 'weight'
# Plotting boxplot comparing the weights of both squirrel sexes
ggplot(Squirrel_new, aes(x = Sex, y = Weight, fill = Sex)) +
geom_boxplot() +
labs(title = "Comparison of Male and Female Squirrel Weights",
x = "Sex",
y = "Weight (kg)") +
theme_minimal()
1B. Perform a two-sample t-test and a Mann-Whitney test.
#Performing two-sample t-test
squirrel_t <-t.test(Weight~Sex, data = Squirrel_new, var.equal = TRUE)
print(squirrel_t)
##
## Two Sample t-test
##
## data: Weight by Sex
## t = -2.1662, df = 98, p-value = 0.03272
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
## -0.139492762 -0.006107238
## sample estimates:
## mean in group FEMALE mean in group MALE
## 0.5180 0.5908
#Performing Mann-Whitney test
squirrel_MW <-wilcox.test(Weight~Sex, data = Squirrel_new, exact = FALSE )
print(squirrel_MW)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Weight by Sex
## W = 1012, p-value = 0.1014
## alternative hypothesis: true location shift is not equal to 0
1C. What features of the plots suggest that the use of a parametric test may be unwise?
The histograms and boxplots plotted both show that the weights of either gender are positively skewed. This suggests that the 2 sets of weight data are not normally distributed for either gender, and hence a parametric test that assumes normality should not be used.
1D. From the results of the (parametric) t-test, report the t, dfs, and p-values, and interpret them.
Parametric test: Two Sample t-test The test statistic is the t-value. In this case, it compares the true difference of means of male and female squirrel weights to the hypothesized difference, which is 0. The larger the t-value (be it positive or negative), the larger the variability between the true difference and hypothesized difference, and the higher the probability that the observed variability is significant. The t-value of this test was -2.1662, which suggests that female squirrels weigh less than male squirrels. Since the p-value 0.03272 < 0.05, this observation is significant and the null hypothesis that states that this difference would be 0 should be rejected in favour of the alternative hypothesis.
The degrees of freedom, df = 98, is the number of values that were considered in estimating the variability of the difference in means from the hypothesized mean. In this case, df = 100 - 2, where 100 is the number of squirrels from the dataset and 2 is derived from the number of times the mean of each group was calculated, where each instance is a constraint.
1E. From the results of the (non-parametric) Mann-Whitney test, report the W, and p-values, and interpret them.
Non-parametric test: Mann-Whitney test The W-statistic is the test statistic. A large W value (in this case 1012) suggests that the difference between the groups is likely to be due to random chance. Since the p-value 0.1014 > 0.05, there is no significant difference between the mean weights of male and female squirrels. This supports the null hypothesis and thus it is not rejected.
1F. What type of error do you get if you use a t-test?
A type I error/false positive where the null hypothesis is wrongly rejected.
An experiment was performed to compare four melon varieties. It was designed so that each variety was grown in six plots — but two plots growing variety 3 were accidentally destroyed. The data (Melons dataset) are in the columns YIELDM and VARIETY.
#loading melon yield data from excel
Melons_data <- read_excel("C:/Users/LIEWY/Downloads/Assignment 1 data(1)(1).xls", sheet = "Melons")
View(Melons_data)
2A. What is/are the hypothesis/es?
H0: The mean yields of all four melon varieties are equal. Ha: The mean yields of the four melon varieties are not all equal.
2B. Plot the data so that you can see the difference in yield among the varieties.
#Data frame currently stores the 4 variety categories in a single column
#Creating subsets of each variety such that individual boxplots can be created
var1 <- subset(Melons_data, VARIETY == 1)
var2 <- subset(Melons_data, VARIETY == 2)
var3 <- subset(Melons_data, VARIETY == 3)
var4 <- subset(Melons_data, VARIETY == 4)
#Plotting boxplot comparing the average yields of the 4 melon varieties
ggplot()+
geom_boxplot(aes(x = 'Var 1', y = var1$YIELDM), colour = 'red')+
geom_boxplot(aes(x = 'Var 2', y = var2$YIELDM), colour = 'purple')+
geom_boxplot(aes(x = 'Var 3', y = var3$YIELDM), colour = 'blue')+
geom_boxplot(aes(x = 'Var 4', y = var4$YIELDM), colour = 'orange')+
labs(title = 'Comparisons of average yields of 4 melon varieties',
x = 'Melon Variety',
y = 'Crop yield (kg)')+
theme_minimal()
2C. Produce descriptive statistics that show you the means of the groups, and the confidence intervals.
#summary using dplyr functions
#Function to compute mean and 95% CI
mean_ci <- function(YIELDM){
n <- length(YIELDM)
mean_yield <- mean(YIELDM)
sd_yield <- sd(YIELDM)
error <- qt(0.975, df = n-1) * sd_yield / sqrt(n)
lower <- mean_yield - error
upper <- mean_yield + error
return(c(mean = mean_yield, lower = lower, upper = upper))
}
# function applied to each variety
ci_var1 <- mean_ci(var1$YIELDM)
ci_var2 <- mean_ci(var2$YIELDM)
ci_var3 <- mean_ci(var3$YIELDM)
ci_var4 <- mean_ci(var4$YIELDM)
# Combine results into a table
results <- rbind(ci_var1, ci_var2, ci_var3, ci_var4)
rownames(results) <- c("Variety 1", "Variety 2", "Variety 3", "Variety 4")
print(results)
## mean lower upper
## Variety 1 20.49000 15.56351 25.41649
## Variety 2 37.40333 33.25754 41.54913
## Variety 3 20.46250 12.88081 28.04419
## Variety 4 29.89667 27.55654 32.23680
2D. Produce an analysis of variance of YIELDM with respect to VARIETY. More than 2 groups are involved -> 1-way ANOVA
MelonANOVA <- lm(YIELDM ~ factor(VARIETY), data = Melons_data)
anova(MelonANOVA)
## Analysis of Variance Table
##
## Response: YIELDM
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(VARIETY) 3 1115.28 371.76 23.798 1.735e-06 ***
## Residuals 18 281.19 15.62
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2E. Show and describe the model’s resulting diagnostic plots.
plot(MelonANOVA)
Residuals vs fitted plot This plot checks for homoscedasticity and linearity. In this case, the residuals are distributed evenly to either side of the regression line, with no tapering in the spread at either end of the fitted values. This suggests that variance is rather constant and the assumptions of a linear model are not violated.
Q-Q residuals plot This plot checks for normality in the residuals. In this plot, the residuals stray from the diagonal line of theoretical normal distribution, suggesting that the normality assumption of a linear model are violated.
Scale-location plot This plot shows if residuals are spread equally along the ranges of predictors and is used to check the assumption of equal variance/ homoscedasticity.The red line strongly deviates from being horizontal, and the residuals are more widely spread towards the higher side of the fitted values. This unequal spread suggests unequal variance in the residuals, thus violating the homoscedasticity assumption pf lm.
4.Residuals vs Leverage plot This plot identifies the presence of strongly influential observations in a regression model. No points seem to fall outside of the Cook’s distance line.
2F. Report the F, dfs, and p-values, and interpret them.
The F-value represents the ratio of variance between groups to the variance within groups, where a larger inter- to intra- group variance ratio produces a higher F-value and suggests that there is a statistically significant difference in group means. The F-value in this case is 23.798. df = 4 - 1 = 3. df for a between-groups ANOVA is calculated as k - 1, where k is the level of factors (in this case, 4 groups). P-value 1.735e-06 < 0.001. This indicates that there is very strong evidence against the null hypothesis where ‘The mean yields of all four melon varieties are equal’, and hence it is rejected in favour of the alternative hypothesis. This suggests that different melon varieties produce significantly different yields.
A plant species is dioecious if each individual produces all male flowers or all female flowers. The Dioecious trees dataset contains data from 50 trees of one particular dioecious species, from a ten hectare area of mixed woodland.
For each individual, the SEX was recorded (coded as 1 for male and 2 for female), the diameter at breast height in millimetres (DBH), and the number of flowers on the tree at the time of measurement (FLOWERS).
Dtrees_data <- read_excel("C:/Users/LIEWY/Downloads/Assignment 1 data(1)(1).xls", sheet = "Dioecious trees")
View(Dtrees_data)
3A. Show graphically how the number of flowers differs between the sexes.
#Dtrees_data is a dataframe where '1' and '2' in SEX are numerical rather than catergorical
#'1' and '2' need to be converted into factors
Dtrees_data$SEX <- factor(Dtrees_data$SEX, levels = c('1','2'), labels = c('MALE', 'FEMALE'))
#plotting a boxplot
Dtrees_data %>%
ggplot(aes(x = SEX, y = FLOWERS, colour = SEX))+
geom_boxplot(outliers = TRUE)+
labs(x = 'Tree Sex',
y = 'Number of flowers produced',
title = 'Number of flowers produced by Male and Female trees')
3B. Test the hypothesis that male and female trees produce different
number of flowers. Y variable -> data that cannot be less than 0 and
hence normality cannot be assumed -> use a non-parametric test X
variable -> sex (2 groups) Test used should be a Mann-Whitney U
test
#Performing Mann-Whitney test
Trees_MW <- wilcox.test(FLOWERS~SEX, data = Dtrees_data, exact = FALSE)
print(Trees_MW)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FLOWERS by SEX
## W = 298, p-value = 0.9763
## alternative hypothesis: true location shift is not equal to 0
3C. Report the appropriate test statistic, dfs, and p-values, and interpret them.
Test statistic (W-statistic) is 298. df = N/A since Mann-Whitney U test was used. p-value 0.9 > 0.05. This indicates that there is no significant difference between the means of the number of flowers produced by male or female trees and the null hypothesis stating that ‘the true difference between the 2 groups is 0’ cannot be rejected.