Main goal of analysis
To test 3 different hypotheses about the data.
data <- read.csv("./Fish.csv")
head(data)
## Species Weight Length1 Length2 Length3 Height Width
## 1 Bream 242 23.2 25.4 30.0 11.5200 4.0200
## 2 Bream 290 24.0 26.3 31.2 12.4800 4.3056
## 3 Bream 340 23.9 26.5 31.1 12.3778 4.6961
## 4 Bream 363 26.3 29.0 33.5 12.7300 4.4555
## 5 Bream 430 26.5 29.0 34.0 12.4440 5.1340
## 6 Bream 450 26.8 29.7 34.7 13.6024 4.9274
unit of observation: 1 fish
sample size: 159
variables
– Species: species name of fish
– Weight: weight of fish in grams
– Length: there are three variables of length of the fish in centimeters. The source is explaining them as Length1=vertical, Length2=diagonal and Length3=cross length. Since these differences are unclear (I assume some are including the fins and others aren’t), and they are not detrimental to the analysis, it was decided to remove 2 of them. I will be keeping the middle length.
Removing columns and renaming variables
mydata <- data[c(-3, -5)]
colnames(mydata) <- c("Species", "Weight", "Length", "Height", "Width")
mydata$Species <- factor (mydata$Species)
head(mydata)
## Species Weight Length Height Width
## 1 Bream 242 25.4 11.5200 4.0200
## 2 Bream 290 26.3 12.4800 4.3056
## 3 Bream 340 26.5 12.3778 4.6961
## 4 Bream 363 29.0 12.7300 4.4555
## 5 Bream 430 29.0 12.4440 5.1340
## 6 Bream 450 29.7 13.6024 4.9274
Replacing zero with NA and dropping NA
mydata <- mydata %>%
replace_with_na(replace = list(Weight = 0,
Length = 0,
Height = 0,
Width = 0))
mydata <- drop_na(mydata)
To test 3 different hypotheses about the data.
Testing the population arithmetic mean with the chosen parameter.
mydata1 <- subset(mydata [c(-2, -4, -5)],
Species == "Bream" )
head(mydata1)
## Species Length
## 1 Bream 25.4
## 2 Bream 26.3
## 3 Bream 26.5
## 4 Bream 29.0
## 5 Bream 29.0
## 6 Bream 29.7
Variable is numeric.
Normality (variable on the population is normally distributed)
No outliers.
First one is met, the 2nd and 3rd will be checked bellow:
ggplot(mydata1, aes(x=Length)) +
geom_histogram(binwidth = 2, colour = "black", fill = "lightblue")+
ylab("Frequency")+
xlab("Length of Bream")
Ho: variable (length of Bream) is normally distributed
H1: variable (length of Bream) is NOT normally distributed
shapiro.test(mydata1$Length)
##
## Shapiro-Wilk normality test
##
## data: mydata1$Length
## W = 0.97961, p-value = 0.7463
p = 0.75 At p-value of 75 % we cannot reject Ho (cannot reject that length is normally distributed).
Assumption of normality is met, therefore we can do the t-test (parametric).
ggplot(mydata1, aes(y = Length)) +
geom_boxplot (fill = "lightblue") +
ggtitle("Bream length") +
ylab("Length [cm]") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
There are no outliers (no points), assumption is met, we can perform the parametrical test.
To be able to do the t-test, we need information to compare our data with. For the purpose of this test, we will say we are comparing the average length of Bream caught this year (the given data) and the last years average length of Bream, which was 35 cm.
Ho: Arithmetic mean of length of Bream caught this year = 35.
Average length of Bream caught this year is equal to average length of Bream caught last year.
H1: Arithmetic mean of length of Bream caught this year ≠ 35.
Average length of Bream caught this year is different to average length of Bream caught last year.
t.test(mydata1$Length,
mu = 35,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata1$Length
## t = -2.8604, df = 34, p-value = 0.007183
## alternative hypothesis: true mean is not equal to 35
## 95 percent confidence interval:
## 31.76478 34.45236
## sample estimates:
## mean of x
## 33.10857
p-value is 0.008 therefore we can reject Ho, meaning we can say that there is statistical difference in the average length of Bream caught this year and average length of Bream caught last year.
effectsize::cohens_d(mydata1$Length, mu = 35)
## Cohen's d | 95% CI
## --------------------------
## -0.48 | [-0.83, -0.13]
##
## - Deviation from a difference of 35.
interpret_cohens_d(0.48, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
Based on the sample data, we found that the average length of Bream caught this year (is 33.11 cm and) has decreased compared to average length of Bream caught last year (p = 0.008, d = 0.48 - small sized effect).
Hypothesis about the difference between two population arithmetic means.
mydata2 <- subset(mydata[c(-2, -4, -5)],
Species == "Bream" | Species == "Perch" )
head(mydata2)
## Species Length
## 1 Bream 25.4
## 2 Bream 26.3
## 3 Bream 26.5
## 4 Bream 29.0
## 5 Bream 29.0
## 6 Bream 29.7
Variable is numeric.
The distribution of the variable is normal in both populations.
Variable has the same variance in both populations (if not: apply Welch correction)
First one is met, the 2nd and 3rd will be checked bellow:
ggplot(mydata2, aes(x=Length)) +
geom_histogram(binwidth = 5, colour="black", fill="lightblue") +
facet_wrap(~Species, ncol = 1) +
ylab("Frequency")
Ho: Distribution is normal.
H1: Distribution is not normal.
mydata2 %>%
group_by(Species)%>%
shapiro_test(Length)
## # A tibble: 2 × 4
## Species variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Bream Length 0.980 0.746
## 2 Perch Length 0.938 0.00624
Bream: Cannot reject Ho (p=0.747).
Perch: Reject Ho at p=0.007. Length of Perch is not normally distributed.
Assumption of normal distribution of variable in both populations is violated.The non-parametrical test will be done.
Ho: Distribuion locations are the same
H1: Distribution locations are different.
wilcox.test(mydata2$Length ~ mydata2$Species,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata2$Length by mydata2$Species
## W = 1348, p-value = 0.002672
## alternative hypothesis: true location shift is not equal to 0
We can reject Ho at p=0.003, meaning the distribution locations for the 2 fish are different.
effectsize(wilcox.test(mydata2$Length ~ mydata2$Species,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.38 | [0.15, 0.56]
interpret_rank_biserial(0.38)
## [1] "large"
## (Rules: funder2019)
Based on the sample data, we found that length of Bream and Perch differ (p=0.003). The difference in distribution location is large (r=0.38).
mydata3 <- mydata[c(-2, -3, -4, -5)]
head(mydata3)
## Species
## 1 Bream
## 2 Bream
## 3 Bream
## 4 Bream
## 5 Bream
## 6 Bream
We take the data set with all available species of fish and will look if the Perch represent 30 % or more of the sample.
n * π > 5 and n(1 - π) > 5
plyr::count(mydata3, "Species")
## Species freq
## 1 Bream 35
## 2 Parkki 11
## 3 Perch 56
## 4 Pike 17
## 5 Roach 19
## 6 Smelt 14
## 7 Whitefish 6
nrow(mydata3)
## [1] 158
158 * 0.3 > 5 This assumption is met. 158 * (1 - 0.3) > 5 This assumption is met.
Ho: π = 0.3
H1: π > 0.3
prop.test(x = 56,
n = 158,
p = 0.3,
correct = FALSE,
alternative = "greater")
##
## 1-sample proportions test without continuity correction
##
## data: 56 out of 158, null probability 0.3
## X-squared = 2.2291, df = 1, p-value = 0.06772
## alternative hypothesis: true p is greater than 0.3
## 95 percent confidence interval:
## 0.2947674 1.0000000
## sample estimates:
## p
## 0.3544304
The p-value of 0.999 is too high, therefore we cannot reject Ho, therefore we can conclude that Perch represents more than 30 % of the sample.