Data

Import the data

data <- read.csv("./Fish.csv")

head(data)

##   Species Weight Length1 Length2 Length3  Height  Width
## 1   Bream    242    23.2    25.4    30.0 11.5200 4.0200
## 2   Bream    290    24.0    26.3    31.2 12.4800 4.3056
## 3   Bream    340    23.9    26.5    31.1 12.3778 4.6961
## 4   Bream    363    26.3    29.0    33.5 12.7300 4.4555
## 5   Bream    430    26.5    29.0    34.0 12.4440 5.1340
## 6   Bream    450    26.8    29.7    34.7 13.6024 4.9274

Explanations

unit of observation: 1 fish
sample size: 159
variables

– Species: species name of fish

– Weight: weight of fish in grams

– Length: there are three variables of length of the fish in centimeters. The source is explaining them as Length1=vertical, Length2=diagonal and Length3=cross length. Since these differences are unclear (I assume some are including the fins and others aren’t), and they are not detrimental to the analysis, it was decided to remove 2 of them. I will be keeping the middle length.

Cleaning the data

Removing columns and renaming variables

mydata <- data[c(-3, -5)]

colnames(mydata) <- c("Species", "Weight", "Length", "Height", "Width")

mydata$Species <- factor (mydata$Species)

head(mydata)

##   Species Weight Length  Height  Width
## 1   Bream    242   25.4 11.5200 4.0200
## 2   Bream    290   26.3 12.4800 4.3056
## 3   Bream    340   26.5 12.3778 4.6961
## 4   Bream    363   29.0 12.7300 4.4555
## 5   Bream    430   29.0 12.4440 5.1340
## 6   Bream    450   29.7 13.6024 4.9274

Replacing zero with NA and dropping NA

mydata <- mydata %>%
  replace_with_na(replace = list(Weight = 0,
                                 Length = 0,
                                 Height = 0,
                                 Width = 0))

mydata <- drop_na(mydata)

Source:

https://www.kaggle.com/datasets/aungpyaeap/fish-market

Main goal of analysis

To test 3 different hypotheses about the data.

1. statistical hypothesis test: t-test or Wicoxon Signed Rank Test

Testing the population arithmetic mean with the chosen parameter.

1.1. Adjust the data

mydata1 <- subset(mydata [c(-2, -4, -5)],
                  Species == "Bream" )

head(mydata1)

##   Species Length
## 1   Bream   25.4
## 2   Bream   26.3
## 3   Bream   26.5
## 4   Bream   29.0
## 5   Bream   29.0
## 6   Bream   29.7

1.2. Assumptions for parametrical test:

Variable is numeric.
Normality (variable on the population is normally distributed)
No outliers.

First one is met, the 2nd and 3rd will be checked bellow:

1.2.1. Check normality (Histogram and Shapiro)

ggplot(mydata1, aes(x=Length)) +
  geom_histogram(binwidth = 2, colour = "black", fill = "lightblue")+
  ylab("Frequency")+
  xlab("Length of Bream")

Ho: variable (length of Bream) is normally distributed

H1: variable (length of Bream) is NOT normally distributed

shapiro.test(mydata1$Length)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$Length
## W = 0.97961, p-value = 0.7463

p = 0.75 At p-value of 75 % we cannot reject Ho (cannot reject that length is normally distributed).

Assumption of normality is met, therefore we can do the t-test (parametric).

1.2.2. Check if there are outliers

ggplot(mydata1, aes(y = Length)) +
  geom_boxplot (fill = "lightblue") +
  ggtitle("Bream length") +
  ylab("Length [cm]") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

There are no outliers (no points), assumption is met, we can perform the parametrical test.

1.3. T-TEST

To be able to do the t-test, we need information to compare our data with. For the purpose of this test, we will say we are comparing the average length of Bream caught this year (the given data) and the last years average length of Bream, which was 35 cm.

Ho: Arithmetic mean of length of Bream caught this year = 35.

Average length of Bream caught this year is equal to average length of Bream caught last year.

H1: Arithmetic mean of length of Bream caught this year ≠ 35.

Average length of Bream caught this year is different to average length of Bream caught last year.

t.test(mydata1$Length,
       mu = 35,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata1$Length
## t = -2.8604, df = 34, p-value = 0.007183
## alternative hypothesis: true mean is not equal to 35
## 95 percent confidence interval:
##  31.76478 34.45236
## sample estimates:
## mean of x 
##  33.10857

p-value is 0.008 therefore we can reject Ho, meaning we can say that there is statistical difference in the average length of Bream caught this year and average length of Bream caught last year.

1.4. EFFECT SIZE

effectsize::cohens_d(mydata1$Length, mu = 35)

## Cohen's d |         95% CI
## --------------------------
## -0.48     | [-0.83, -0.13]
## 
## - Deviation from a difference of 35.

interpret_cohens_d(0.48, rules = "sawilowsky2009")

## [1] "small"
## (Rules: sawilowsky2009)

Based on the sample data, we found that the average length of Bream caught this year (is 33.11 cm and) has decreased compared to average length of Bream caught last year (p = 0.008, d = 0.48 - small sized effect).

2. statistical hypothesis test: independent t-test or Wicoxon Rank Sum Test

Hypothesis about the difference between two population arithmetic means.

2.1. Adjust the data

mydata2 <- subset(mydata[c(-2, -4, -5)],
                  Species == "Bream" | Species == "Perch" )

head(mydata2)

##   Species Length
## 1   Bream   25.4
## 2   Bream   26.3
## 3   Bream   26.5
## 4   Bream   29.0
## 5   Bream   29.0
## 6   Bream   29.7

2.2. Assumptions for parametrical test

Variable is numeric.
The distribution of the variable is normal in both populations.
Variable has the same variance in both populations (if not: apply Welch correction)

First one is met, the 2nd and 3rd will be checked bellow:

2.2.1. Check normality (Histogram and Shapiro)

ggplot(mydata2, aes(x=Length)) +
  geom_histogram(binwidth = 5, colour="black", fill="lightblue") +
  facet_wrap(~Species, ncol = 1) +
  ylab("Frequency")

Ho: Distribution is normal.

H1: Distribution is not normal.

mydata2 %>%
  group_by(Species)%>%
  shapiro_test(Length)

## # A tibble: 2 × 4
##   Species variable statistic       p
##   <fct>   <chr>        <dbl>   <dbl>
## 1 Bream   Length       0.980 0.746  
## 2 Perch   Length       0.938 0.00624

Bream: Cannot reject Ho (p=0.747).

Perch: Reject Ho at p=0.007. Length of Perch is not normally distributed.

Assumption of normal distribution of variable in both populations is violated.The non-parametrical test will be done.

2.3. WILCOXON RANK SUM TEST

Ho: Distribuion locations are the same

H1: Distribution locations are different.

wilcox.test(mydata2$Length ~ mydata2$Species,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$Length by mydata2$Species
## W = 1348, p-value = 0.002672
## alternative hypothesis: true location shift is not equal to 0

We can reject Ho at p=0.003, meaning the distribution locations for the 2 fish are different.

2.4. EFFECT SIZE

effectsize(wilcox.test(mydata2$Length ~ mydata2$Species,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.38              | [0.15, 0.56]

interpret_rank_biserial(0.38)

## [1] "large"
## (Rules: funder2019)

Based on the sample data, we found that length of Bream and Perch differ (p=0.003). The difference in distribution location is large (r=0.38).

3. statistical hypothesis test: the population proportion

3.1. Adjust the data

mydata3 <- mydata[c(-2, -3, -4, -5)]

head(mydata3)

##   Species
## 1   Bream
## 2   Bream
## 3   Bream
## 4   Bream
## 5   Bream
## 6   Bream

We take the data set with all available species of fish and will look if the Perch represent 30 % or more of the sample.

3.2. Assumptions for parametrical test:

n * π > 5 and n(1 - π) > 5

plyr::count(mydata3, "Species")

##     Species freq
## 1     Bream   35
## 2    Parkki   11
## 3     Perch   56
## 4      Pike   17
## 5     Roach   19
## 6     Smelt   14
## 7 Whitefish    6

nrow(mydata3)

## [1] 158

158 * 0.3 > 5 This assumption is met. 158 * (1 - 0.3) > 5 This assumption is met.

3.3. PROP-TEST

Ho: π = 0.3

H1: π > 0.3

prop.test(x = 56,
          n = 158,
          p = 0.3,
          correct = FALSE,
          alternative = "greater")

## 
##  1-sample proportions test without continuity correction
## 
## data:  56 out of 158, null probability 0.3
## X-squared = 2.2291, df = 1, p-value = 0.06772
## alternative hypothesis: true p is greater than 0.3
## 95 percent confidence interval:
##  0.2947674 1.0000000
## sample estimates:
##         p 
## 0.3544304

The p-value of 0.999 is too high, therefore we cannot reject Ho, therefore we can conclude that Perch represents more than 30 % of the sample.

HW R 2_Despot

Dragana Despot

2023-01-12