RESEARCH QUESTION: The possum original data frame consists of eight morphometric measurements on each of 104 mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland.

Are male possums on average larger than the female possums? Are tails and heads of male possums bigger on average compared to female possums?

Source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum?search=possum

mydata <- read.table("./possum.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[ , -c(1,2,7,10,11,12,13,14)]
colnames(mydata) <- c("Population", "Sex", "Age", "HeadLength", "TotalLength", "TailLength")
mydata$GenderFactor <- factor(mydata$Sex,
                           levels = c("m","f"),
                           labels = c("1", "0"))
mydata$Gender <- factor(mydata$Sex,
                           levels = c("m","f"),
                           labels = c("Male", "Female"))
mydata$PopulationFactor <- factor(mydata$Population,
                           levels = c("Vic","other"),
                           labels = c("Victoria", "Other"))
mydata1 <- drop_na(mydata)
mydata1 <- mydata1[ , -c(1,2)]
head(mydata1)
##   Age HeadLength TotalLength TailLength GenderFactor Gender PopulationFactor
## 1   8       94.1        89.0       36.0            1   Male         Victoria
## 2   6       92.5        91.5       36.5            0 Female         Victoria
## 3   6       94.0        95.5       39.0            0 Female         Victoria
## 4   6       93.2        92.0       38.0            0 Female         Victoria
## 5   2       91.5        85.5       36.0            0 Female         Victoria
## 6   1       93.1        90.5       35.5            0 Female         Victoria

In the data manipulation I filtered the table for the relevant information regarding possums, renamed the columns and changed the labels. I also dropped the two lines including data that were not available.

The unit of observation is a possum caught in Victoria, new South Wales or Queensland. After filtering the data I was left out with 102 units of observation. I analyzed 6 variables described below. Data is of both, numerical and categorical type.

The outliers are not present and the data about head lengths, tail lengths and total lengths are naturally correlated among themselves and also with age of the possum.

Description of the variables:

As stated in the research question, the goal is to find out whether length of the animal, their tail and head lengths are greater in male compared to female possums.

summary(mydata1)
##       Age          HeadLength      TotalLength      TailLength    GenderFactor    Gender   PopulationFactor
##  Min.   :1.000   Min.   : 82.50   Min.   :75.00   Min.   :32.00   1:59         Male  :59   Victoria:44     
##  1st Qu.:2.250   1st Qu.: 90.70   1st Qu.:84.12   1st Qu.:36.00   0:43         Female:43   Other   :58     
##  Median :3.000   Median : 92.85   Median :88.00   Median :37.00                                            
##  Mean   :3.833   Mean   : 92.69   Mean   :87.23   Mean   :37.04                                            
##  3rd Qu.:5.000   3rd Qu.: 94.78   3rd Qu.:90.00   3rd Qu.:38.00                                            
##  Max.   :9.000   Max.   :103.10   Max.   :96.50   Max.   :43.00

The median is a measure of central tendency that represents the midpoint of a dataset. It is the value that separates the higher half from the lower half of a dataset. To find the median of a dataset, you need to arrange the values in numerical order and then find the middle value. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the average of the two middle values.

Similarly the mean is a measure of central tendency that represents the average of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the number of values.

After computing the descriptive statistics we observe the information shown above. The age was ranging from 1 to 9 years with the median of 3 years and the mean of 3.883. This means that the sum of all years of trapped possums divided by the number of possums is 3.883 and half of the possums were younger than 3 years old or 3 years old and half of them were older. Head length was ranging from 82.50mm to 103.10mm, tail length was between 32 and 43 centimeters. The means were 92,69mm and 37.04cm. Medians were 92.85mm and 37.00cm.

library(RColorBrewer)
a <- 59/102
b <- 43/102
proportions <- c(a,b)
myPalette <- brewer.pal(5, "Set2")  
pie(proportions, labels = c("Male","Female"), border="Black", col=myPalette)

In the analysis 57.7% of all possums were male and 42.3% were female.

library(ggplot2)
ggplot(mydata1, aes (x = TotalLength, fill=Gender)) +
         (geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
  xlab("total length") +
         ylab("Frequency") +
        ggtitle("Distribution of possums total length based on gender")

aggregate(mydata1$TotalLength, list(mydata1$Gender),mean)
##   Group.1        x
## 1    Male 86.73220
## 2  Female 87.90698
aggregate(mydata1$TotalLength, list(mydata1$Gender),sd)
##   Group.1        x
## 1    Male 4.173828
## 2  Female 4.182241

The above graph shows the distribution of possums based on length by gender. As we can see, the males were smaller on average than the female possums, with their average length of 87.9cm compared to the average length of males which was 86.7cm. Standard deviations are very similar, with difference of 0.012cm.

This can be seen on the graph, as the peak of the female length distribution is moved more to the right, towards higher values. Both distributions seem normal, however the distribution of male possum lengths has two peaks.

library(ggplot2)
ggplot(mydata1, aes (x = TailLength, fill=Gender)) +
         (geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
  xlab("Tail length") +
         ylab("Frequency") +
        ggtitle("Distribution of possums tail length based on gender")

aggregate(mydata1$TailLength, list(mydata1$Gender),mean)
##   Group.1        x
## 1    Male 37.00000
## 2  Female 37.10465
aggregate(mydata1$TailLength, list(mydata1$Gender),sd)
##   Group.1        x
## 1    Male 2.067816
## 2  Female 1.830815
ggplot(mydata1, aes (x = HeadLength, fill=Gender)) +
         (geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
  xlab("Head length") +
         ylab("Frequency") +
        ggtitle("Distribution of possums head length based on gender")

aggregate(mydata1$HeadLength, list(mydata1$Gender),mean)
##   Group.1        x
## 1    Male 93.08136
## 2  Female 92.14884
aggregate(mydata1$HeadLength, list(mydata1$Gender),sd)
##   Group.1        x
## 1    Male 4.061190
## 2  Female 2.574913

The female possums might be bigger on average and their tails are longer compared to males, however males have a bigger head than female possums. The standard deviation of male head length is twice as much as the female head length, which can be observed on the graph, where the distribution is much wider looking at male animals.

library(ggplot2)
ggplot(mydata1, aes (x = Age, fill=Gender)) +
         (geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
  xlab("Age") +
         ylab("Frequency") +
        ggtitle("Distribution of possums age by gender")

aggregate(mydata1$Age, list(mydata1$Gender),mean)
##   Group.1        x
## 1    Male 3.728814
## 2  Female 3.976744

To conclude I looked at possums distribution by age, as this could affect the size of the possums. We see that trapped females were older by 0.25 years on average or 3 months. After observing the graph we can see that distributions are very similar, with more younger males on average.