Possum analysis homework 2

RESEARCH QUESTION: The possum original data frame consists of eight morphometric measurements on each of 104 mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland.

Source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum?search=possum

mydata <- read.table("./possum.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[ , -c(1,2,7,10,11,12,13,14)]
colnames(mydata) <- c("Population", "Sex", "Age", "HeadLength", "TotalLength", "TailLength")
mydata$Gender <- factor(mydata$Sex,
                           levels = c("m","f"),
                           labels = c("Male", "Female"))
mydata$PopulationFactor <- factor(mydata$Population,
                           levels = c("Vic","other"),
                           labels = c("Victoria", "Other"))
mydata1 <- drop_na(mydata)
mydata1 <- mydata1[ , -c(1,2)]
head(mydata1)

##   Age HeadLength TotalLength TailLength Gender PopulationFactor
## 1   8       94.1        89.0       36.0   Male         Victoria
## 2   6       92.5        91.5       36.5 Female         Victoria
## 3   6       94.0        95.5       39.0 Female         Victoria
## 4   6       93.2        92.0       38.0 Female         Victoria
## 5   2       91.5        85.5       36.0 Female         Victoria
## 6   1       93.1        90.5       35.5 Female         Victoria

In the data manipulation I filtered the table for the relevant information regarding possums, renamed the columns and changed the labels. I also dropped the two lines including data that were not available.

The unit of observation is a possum caught in Victoria, new South Wales or Queensland. After filtering the data I was left out with 102 units of observation. I analyzed 6 variables described below. Data is of both, numerical and categorical type.

The outliers are not present and the data about head lengths, tail lengths and total lengths are naturally correlated among themselves and also with age of the possum.

Description of the variables:

Age: The age of captured possum in years.
HeadLength: Length of the head, in mm.
TotalLength: Total length of the animal, in cm.
TailLength: Length of the tail, in cm.
GenderFactor: Gender, 0 is for female and 1 is for male.
PopulationFactor: Population, either Victoria or Other (New South Wales or Queensland), depending on where the possum was trapped.

As stated in the research question, the goal is to find out whether length of the animal, their tail and head lengths are greater in male compared to female possums.

summary(mydata1)

##       Age          HeadLength      TotalLength      TailLength       Gender   PopulationFactor
##  Min.   :1.000   Min.   : 82.50   Min.   :75.00   Min.   :32.00   Male  :59   Victoria:44     
##  1st Qu.:2.250   1st Qu.: 90.70   1st Qu.:84.12   1st Qu.:36.00   Female:43   Other   :58     
##  Median :3.000   Median : 92.85   Median :88.00   Median :37.00                               
##  Mean   :3.833   Mean   : 92.69   Mean   :87.23   Mean   :37.04                               
##  3rd Qu.:5.000   3rd Qu.: 94.78   3rd Qu.:90.00   3rd Qu.:38.00                               
##  Max.   :9.000   Max.   :103.10   Max.   :96.50   Max.   :43.00

The median is a measure of central tendency that represents the midpoint of a dataset. It is the value that separates the higher half from the lower half of a dataset. To find the median of a dataset, you need to arrange the values in numerical order and then find the middle value. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the average of the two middle values.

Similarly the mean is a measure of central tendency that represents the average of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the number of values.

After computing the descriptive statistics we observe the information shown above. The age was ranging from 1 to 9 years with the median of 3 years and the mean of 3.883. This means that the sum of all years of trapped possums divided by the number of possums is 3.883 and half of the possums were younger than 3 years old or 3 years old and half of them were older. Head length was ranging from 82.50mm to 103.10mm, tail length was between 32 and 43 centimeters. The means were 92,69mm and 37.04cm. Medians were 92.85mm and 37.00cm.

1: Are male possums on average larger than the female possums? (Independent samples t-test)

I set the following hypothesis that are going to be tested:

H0: Male and female possums have on average the same total length.

H1: ale and female possums have on average different total length.

library(ggplot2)

Male <- ggplot(mydata1[mydata1$Gender == "Male", ], aes(x= TotalLength)) +
  theme_minimal() +
  geom_histogram(binwidth = 3, fill = "blue", color = "black") +
  ggtitle("Lengths of males")

Female <- ggplot(mydata1[mydata1$Gender == "Female", ], aes(x= TotalLength)) +
  theme_minimal() +
  geom_histogram(binwidth = 3, fill = "pink",color = "black") +
  ggtitle("Lengths of females")

library(ggpubr)
ggarrange(Male, Female, 
          ncol = 2)

The length of males does not look normally distributed, as it has two peaks. Distribution of legth of female possums looks normally distributed. To be sure, I am going to check the normality of distributions for lengths with Shapiro-Wilk test.

H0: variable is normally distributed.

H1: variable is not normally distributed.

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata1 %>% 
  rstatix::group_by(Gender) %>% 
  shapiro_test(TotalLength)

## # A tibble: 2 × 4
##   Gender variable    statistic     p
##   <fct>  <chr>           <dbl> <dbl>
## 1 Male   TotalLength     0.979 0.384
## 2 Female TotalLength     0.972 0.371

As p-values are not smaller than 0.05 (5%), we can not reject the null hypothesis. The next assumption to have to test is if the TotalLength variable has the same variance for male and female possums.

H0: variable has the same variance.

H1: Variable has different variance.

aggregate(mydata1$TotalLength, list(mydata1$Gender),sd)

##   Group.1        x
## 1    Male 4.173828
## 2  Female 4.182241

var.test(TotalLength ~ Gender , data = mydata1)

## 
##  F test to compare two variances
## 
## data:  TotalLength by Gender
## F = 0.99598, num df = 58, denom df = 42, p-value = 0.9766
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5568723 1.7354568
## sample estimates:
## ratio of variances 
##          0.9959809

We can not reject the null hypothesis about variance being different regarding the length of male and female possums.

As variable is numeric, the distribution of the variable is normal in both populations, the data comes from 2 independent populations and the variable Total Length has the same variance in both populations, I will perform the Independent Samples t-test.

t.test(mydata1$TotalLength ~ mydata1$Gender, 
       paired = FALSE,
       var.equal = TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  mydata1$TotalLength by mydata1$Gender
## t = -1.4025, df = 100, p-value = 0.1639
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -2.8365688  0.4870221
## sample estimates:
##   mean in group Male mean in group Female 
##             86.73220             87.90698

From p-value (or from the 95 percent confidence interval and t=-1.4025, which lies on the confidence interval) we can not reject the null hypothesis and we conclude that the total length of the possums is not influenced by gender of the caught animal.We see that on average the length of male and female possums is the same.

2: Is there an association between the gender and the location of the trapped animal?

I wanted to test if the gender of the caught animal varied depending on the location they were trapped. I had to perform the chi-square test. The hypothesis were as it follows:

H0: There is no association between the gender and the area they were trapped

H1: There is an association between the gender and the area they were trapped

obmocje_spol <- chisq.test(mydata1$Gender, mydata1$PopulationFactor, correct = TRUE)

obmocje_spol

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata1$Gender and mydata1$PopulationFactor
## X-squared = 4.0177, df = 1, p-value = 0.04502

As p-value is smaller than 5%, we conclude that there is an association between gender and the area the possums were trapped.

#Here are some additional statistics that we also performed at the lectures.

addmargins(obmocje_spol$observed)

##               mydata1$PopulationFactor
## mydata1$Gender Victoria Other Sum
##         Male         20    39  59
##         Female       24    19  43
##         Sum          44    58 102

As seen from above results, only 20 males were caught in Victoria and 39 were caught in Queensland or New South Wales. On the other hand, more females were caught in Victoria with 24, compared to 19 trapped in other two areas.

round(obmocje_spol$expected, 2)

##               mydata1$PopulationFactor
## mydata1$Gender Victoria Other
##         Male      25.45 33.55
##         Female    18.55 24.45

Here we see the expected number of animals trapped by gender and area where they were trapped.

round(obmocje_spol$res,2)

##               mydata1$PopulationFactor
## mydata1$Gender Victoria Other
##         Male      -1.08  0.94
##         Female     1.27 -1.10

We see that residuals are not statistically significant neither at = 0.05 nor at = 0.01

addmargins(round(prop.table(obmocje_spol$observed),3))

##               mydata1$PopulationFactor
## mydata1$Gender Victoria Other   Sum
##         Male      0.196 0.382 0.578
##         Female    0.235 0.186 0.421
##         Sum       0.431 0.568 0.999

We can observe the percentages of trapped animals by gender and by the are they were trapped.

addmargins(round(prop.table(obmocje_spol$observed,1),3),2)

##               mydata1$PopulationFactor
## mydata1$Gender Victoria Other   Sum
##         Male      0.339 0.661 1.000
##         Female    0.558 0.442 1.000

33.9% of male and 55.8% possums were trapped in Victoria and 66.1% of male and 44.2% of female possums were caught in Queensland or New South Wales.

addmargins(round(prop.table(obmocje_spol$observed,2),3),1)

##               mydata1$PopulationFactor
## mydata1$Gender Victoria Other
##         Male      0.455 0.672
##         Female    0.545 0.328
##         Sum       1.000 1.000

The table is similar to the previous one, 45.5% of animals trapped in Victoria were male and others were female. In New South Wales and Queensland 67.2% of all caugth animals were male and others were female.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cramers_v(mydata1$Gender, mydata1$PopulationFactor)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.20              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.20)

## [1] "medium"
## (Rules: funder2019)

The effect size was 0.20 and that means a medium effect, in other words the area of entrapment and gender have a medium effect on eachother.

3: Can we conclude that there are more male than female possums trapped?

First we have to check the assumption that we can make the test of population proportion:

p = 0.5
n = 120

assumption <- p*n > 5 & (1-p)*n> 5 
assumption

## [1] TRUE

As we can see the assumption is met and we can perform the wanted test.

H0: there is more male than female possums trapped.

H1: there is not more male than female possums trapped.

prop.test(x=59,
          n=102,
          p=0.5,
          correct = FALSE,
          alternative = "greater")

## 
##  1-sample proportions test without continuity correction
## 
## data:  59 out of 102, null probability 0.5
## X-squared = 2.5098, df = 1, p-value = 0.05657
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.4970004 1.0000000
## sample estimates:
##         p 
## 0.5784314

From the test we can not reject the null hypothesis and we see that there are not more male than female possums trapped.

Possum analysis homework 2

Matevž Kopač

2023-01-17

1: Are male possums on average larger than the female possums? (Independent samples t-test)

2: Is there an association between the gender and the location of the trapped animal?

3: Can we conclude that there are more male than female possums trapped?