RESEARCH QUESTION: The possum original data frame consists of eight morphometric measurements on each of 104 mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland.
Source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum?search=possum
mydata <- read.table("./possum.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[ , -c(1,2,7,10,11,12,13,14)]
colnames(mydata) <- c("Population", "Sex", "Age", "HeadLength", "TotalLength", "TailLength")
mydata$Gender <- factor(mydata$Sex,
levels = c("m","f"),
labels = c("Male", "Female"))
mydata$PopulationFactor <- factor(mydata$Population,
levels = c("Vic","other"),
labels = c("Victoria", "Other"))
mydata1 <- drop_na(mydata)
mydata1 <- mydata1[ , -c(1,2)]
head(mydata1)
## Age HeadLength TotalLength TailLength Gender PopulationFactor
## 1 8 94.1 89.0 36.0 Male Victoria
## 2 6 92.5 91.5 36.5 Female Victoria
## 3 6 94.0 95.5 39.0 Female Victoria
## 4 6 93.2 92.0 38.0 Female Victoria
## 5 2 91.5 85.5 36.0 Female Victoria
## 6 1 93.1 90.5 35.5 Female Victoria
In the data manipulation I filtered the table for the relevant information regarding possums, renamed the columns and changed the labels. I also dropped the two lines including data that were not available.
The unit of observation is a possum caught in Victoria, new South Wales or Queensland. After filtering the data I was left out with 102 units of observation. I analyzed 6 variables described below. Data is of both, numerical and categorical type.
The outliers are not present and the data about head lengths, tail lengths and total lengths are naturally correlated among themselves and also with age of the possum.
Description of the variables:
Age: The age of captured possum in years.
HeadLength: Length of the head, in mm.
TotalLength: Total length of the animal, in cm.
TailLength: Length of the tail, in cm.
GenderFactor: Gender, 0 is for female and 1 is for male.
PopulationFactor: Population, either Victoria or Other (New South Wales or Queensland), depending on where the possum was trapped.
As stated in the research question, the goal is to find out whether length of the animal, their tail and head lengths are greater in male compared to female possums.
summary(mydata1)
## Age HeadLength TotalLength TailLength Gender PopulationFactor
## Min. :1.000 Min. : 82.50 Min. :75.00 Min. :32.00 Male :59 Victoria:44
## 1st Qu.:2.250 1st Qu.: 90.70 1st Qu.:84.12 1st Qu.:36.00 Female:43 Other :58
## Median :3.000 Median : 92.85 Median :88.00 Median :37.00
## Mean :3.833 Mean : 92.69 Mean :87.23 Mean :37.04
## 3rd Qu.:5.000 3rd Qu.: 94.78 3rd Qu.:90.00 3rd Qu.:38.00
## Max. :9.000 Max. :103.10 Max. :96.50 Max. :43.00
The median is a measure of central tendency that represents the midpoint of a dataset. It is the value that separates the higher half from the lower half of a dataset. To find the median of a dataset, you need to arrange the values in numerical order and then find the middle value. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the average of the two middle values.
Similarly the mean is a measure of central tendency that represents the average of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the number of values.
After computing the descriptive statistics we observe the information shown above. The age was ranging from 1 to 9 years with the median of 3 years and the mean of 3.883. This means that the sum of all years of trapped possums divided by the number of possums is 3.883 and half of the possums were younger than 3 years old or 3 years old and half of them were older. Head length was ranging from 82.50mm to 103.10mm, tail length was between 32 and 43 centimeters. The means were 92,69mm and 37.04cm. Medians were 92.85mm and 37.00cm.
I set the following hypothesis that are going to be tested:
H0: Male and female possums have on average the same total length.
H1: ale and female possums have on average different total length.
library(ggplot2)
Male <- ggplot(mydata1[mydata1$Gender == "Male", ], aes(x= TotalLength)) +
theme_minimal() +
geom_histogram(binwidth = 3, fill = "blue", color = "black") +
ggtitle("Lengths of males")
Female <- ggplot(mydata1[mydata1$Gender == "Female", ], aes(x= TotalLength)) +
theme_minimal() +
geom_histogram(binwidth = 3, fill = "pink",color = "black") +
ggtitle("Lengths of females")
library(ggpubr)
ggarrange(Male, Female,
ncol = 2)
The length of males does not look normally distributed, as it has two
peaks. Distribution of legth of female possums looks normally
distributed. To be sure, I am going to check the normality of
distributions for lengths with Shapiro-Wilk test.
H0: variable is normally distributed.
H1: variable is not normally distributed.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata1 %>%
rstatix::group_by(Gender) %>%
shapiro_test(TotalLength)
## # A tibble: 2 × 4
## Gender variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male TotalLength 0.979 0.384
## 2 Female TotalLength 0.972 0.371
As p-values are not smaller than 0.05 (5%), we can not reject the null hypothesis. The next assumption to have to test is if the TotalLength variable has the same variance for male and female possums.
H0: variable has the same variance.
H1: Variable has different variance.
aggregate(mydata1$TotalLength, list(mydata1$Gender),sd)
## Group.1 x
## 1 Male 4.173828
## 2 Female 4.182241
var.test(TotalLength ~ Gender , data = mydata1)
##
## F test to compare two variances
##
## data: TotalLength by Gender
## F = 0.99598, num df = 58, denom df = 42, p-value = 0.9766
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5568723 1.7354568
## sample estimates:
## ratio of variances
## 0.9959809
We can not reject the null hypothesis about variance being different regarding the length of male and female possums.
As variable is numeric, the distribution of the variable is normal in both populations, the data comes from 2 independent populations and the variable Total Length has the same variance in both populations, I will perform the Independent Samples t-test.
t.test(mydata1$TotalLength ~ mydata1$Gender,
paired = FALSE,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: mydata1$TotalLength by mydata1$Gender
## t = -1.4025, df = 100, p-value = 0.1639
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -2.8365688 0.4870221
## sample estimates:
## mean in group Male mean in group Female
## 86.73220 87.90698
From p-value (or from the 95 percent confidence interval and t=-1.4025, which lies on the confidence interval) we can not reject the null hypothesis and we conclude that the total length of the possums is not influenced by gender of the caught animal.We see that on average the length of male and female possums is the same.
I wanted to test if the gender of the caught animal varied depending on the location they were trapped. I had to perform the chi-square test. The hypothesis were as it follows:
H0: There is no association between the gender and the area they were trapped
H1: There is an association between the gender and the area they were trapped
obmocje_spol <- chisq.test(mydata1$Gender, mydata1$PopulationFactor, correct = TRUE)
obmocje_spol
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata1$Gender and mydata1$PopulationFactor
## X-squared = 4.0177, df = 1, p-value = 0.04502
As p-value is smaller than 5%, we conclude that there is an association between gender and the area the possums were trapped.
#Here are some additional statistics that we also performed at the lectures.
addmargins(obmocje_spol$observed)
## mydata1$PopulationFactor
## mydata1$Gender Victoria Other Sum
## Male 20 39 59
## Female 24 19 43
## Sum 44 58 102
As seen from above results, only 20 males were caught in Victoria and 39 were caught in Queensland or New South Wales. On the other hand, more females were caught in Victoria with 24, compared to 19 trapped in other two areas.
round(obmocje_spol$expected, 2)
## mydata1$PopulationFactor
## mydata1$Gender Victoria Other
## Male 25.45 33.55
## Female 18.55 24.45
Here we see the expected number of animals trapped by gender and area where they were trapped.
round(obmocje_spol$res,2)
## mydata1$PopulationFactor
## mydata1$Gender Victoria Other
## Male -1.08 0.94
## Female 1.27 -1.10
We see that residuals are not statistically significant neither at = 0.05 nor at = 0.01
addmargins(round(prop.table(obmocje_spol$observed),3))
## mydata1$PopulationFactor
## mydata1$Gender Victoria Other Sum
## Male 0.196 0.382 0.578
## Female 0.235 0.186 0.421
## Sum 0.431 0.568 0.999
We can observe the percentages of trapped animals by gender and by the are they were trapped.
addmargins(round(prop.table(obmocje_spol$observed,1),3),2)
## mydata1$PopulationFactor
## mydata1$Gender Victoria Other Sum
## Male 0.339 0.661 1.000
## Female 0.558 0.442 1.000
33.9% of male and 55.8% possums were trapped in Victoria and 66.1% of male and 44.2% of female possums were caught in Queensland or New South Wales.
addmargins(round(prop.table(obmocje_spol$observed,2),3),1)
## mydata1$PopulationFactor
## mydata1$Gender Victoria Other
## Male 0.455 0.672
## Female 0.545 0.328
## Sum 1.000 1.000
The table is similar to the previous one, 45.5% of animals trapped in Victoria were male and others were female. In New South Wales and Queensland 67.2% of all caugth animals were male and others were female.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata1$Gender, mydata1$PopulationFactor)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.20 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.20)
## [1] "medium"
## (Rules: funder2019)
The effect size was 0.20 and that means a medium effect, in other words the area of entrapment and gender have a medium effect on eachother.
First we have to check the assumption that we can make the test of population proportion:
p = 0.5
n = 120
assumption <- p*n > 5 & (1-p)*n> 5
assumption
## [1] TRUE
As we can see the assumption is met and we can perform the wanted test.
H0: there is more male than female possums trapped.
H1: there is not more male than female possums trapped.
prop.test(x=59,
n=102,
p=0.5,
correct = FALSE,
alternative = "greater")
##
## 1-sample proportions test without continuity correction
##
## data: 59 out of 102, null probability 0.5
## X-squared = 2.5098, df = 1, p-value = 0.05657
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.4970004 1.0000000
## sample estimates:
## p
## 0.5784314
From the test we can not reject the null hypothesis and we see that there are not more male than female possums trapped.