RESEARCH QUESTION: The possum original data frame consists of eight morphometric measurements on each of 104 mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland.
Are male possums on average larger than the female possums? Are tails and heads of male possums bigger on average compared to female possums?
Source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum?search=possum
mydata <- read.table("./possum.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[ , -c(1,2,7,10,11,12,13,14)]
colnames(mydata) <- c("Population", "Sex", "Age", "HeadLength", "TotalLength", "TailLength")
mydata$GenderFactor <- factor(mydata$Sex,
levels = c("m","f"),
labels = c("1", "0"))
mydata$Gender <- factor(mydata$Sex,
levels = c("m","f"),
labels = c("Male", "Female"))
mydata$PopulationFactor <- factor(mydata$Population,
levels = c("Vic","other"),
labels = c("Victoria", "Other"))
mydata1 <- drop_na(mydata)
mydata1 <- mydata1[ , -c(1,2)]
head(mydata1)
## Age HeadLength TotalLength TailLength GenderFactor Gender PopulationFactor
## 1 8 94.1 89.0 36.0 1 Male Victoria
## 2 6 92.5 91.5 36.5 0 Female Victoria
## 3 6 94.0 95.5 39.0 0 Female Victoria
## 4 6 93.2 92.0 38.0 0 Female Victoria
## 5 2 91.5 85.5 36.0 0 Female Victoria
## 6 1 93.1 90.5 35.5 0 Female Victoria
In the data manipulation I filtered the table for the relevant information regarding possums, renamed the columns and changed the labels. I also dropped the two lines including data that were not available.
The unit of observation is a possum caught in Victoria, new South Wales or Queensland. After filtering the data I was left out with 102 units of observation. I analyzed 6 variables described below. Data is of both, numerical and categorical type.
The outliers are not present and the data about head lengths, tail lengths and total lengths are naturally correlated among themselves and also with age of the possum.
Description of the variables:
Age: The age of captured possum in years.
HeadLength: Length of the head, in mm.
TotalLength: Total length of the animal, in cm.
TailLength: Length of the tail, in cm.
GenderFactor: Gender, 0 is for female and 1 is for male.
PopulationFactor: Population, either Victoria or Other (New South Wales or Queensland), depending on where the possum was trapped.
As stated in the research question, the goal is to find out whether length of the animal, their tail and head lengths are greater in male compared to female possums.
summary(mydata1)
## Age HeadLength TotalLength TailLength GenderFactor Gender PopulationFactor
## Min. :1.000 Min. : 82.50 Min. :75.00 Min. :32.00 1:59 Male :59 Victoria:44
## 1st Qu.:2.250 1st Qu.: 90.70 1st Qu.:84.12 1st Qu.:36.00 0:43 Female:43 Other :58
## Median :3.000 Median : 92.85 Median :88.00 Median :37.00
## Mean :3.833 Mean : 92.69 Mean :87.23 Mean :37.04
## 3rd Qu.:5.000 3rd Qu.: 94.78 3rd Qu.:90.00 3rd Qu.:38.00
## Max. :9.000 Max. :103.10 Max. :96.50 Max. :43.00
The median is a measure of central tendency that represents the midpoint of a dataset. It is the value that separates the higher half from the lower half of a dataset. To find the median of a dataset, you need to arrange the values in numerical order and then find the middle value. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the average of the two middle values.
Similarly the mean is a measure of central tendency that represents the average of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the number of values.
After computing the descriptive statistics we observe the information shown above. The age was ranging from 1 to 9 years with the median of 3 years and the mean of 3.883. This means that the sum of all years of trapped possums divided by the number of possums is 3.883 and half of the possums were younger than 3 years old or 3 years old and half of them were older. Head length was ranging from 82.50mm to 103.10mm, tail length was between 32 and 43 centimeters. The means were 92,69mm and 37.04cm. Medians were 92.85mm and 37.00cm.
library(RColorBrewer)
a <- 59/102
b <- 43/102
proportions <- c(a,b)
myPalette <- brewer.pal(5, "Set2")
pie(proportions, labels = c("Male","Female"), border="Black", col=myPalette)
In the analysis 57.7% of all possums were male and 42.3% were female.
library(ggplot2)
ggplot(mydata1, aes (x = TotalLength, fill=Gender)) +
(geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
xlab("total length") +
ylab("Frequency") +
ggtitle("Distribution of possums total length based on gender")
aggregate(mydata1$TotalLength, list(mydata1$Gender),mean)
## Group.1 x
## 1 Male 86.73220
## 2 Female 87.90698
aggregate(mydata1$TotalLength, list(mydata1$Gender),sd)
## Group.1 x
## 1 Male 4.173828
## 2 Female 4.182241
The above graph shows the distribution of possums based on length by gender. As we can see, the males were smaller on average than the female possums, with their average length of 87.9cm compared to the average length of males which was 86.7cm. Standard deviations are very similar, with difference of 0.012cm.
This can be seen on the graph, as the peak of the female length distribution is moved more to the right, towards higher values. Both distributions seem normal, however the distribution of male possum lengths has two peaks.
library(ggplot2)
ggplot(mydata1, aes (x = TailLength, fill=Gender)) +
(geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
xlab("Tail length") +
ylab("Frequency") +
ggtitle("Distribution of possums tail length based on gender")
aggregate(mydata1$TailLength, list(mydata1$Gender),mean)
## Group.1 x
## 1 Male 37.00000
## 2 Female 37.10465
aggregate(mydata1$TailLength, list(mydata1$Gender),sd)
## Group.1 x
## 1 Male 2.067816
## 2 Female 1.830815
ggplot(mydata1, aes (x = HeadLength, fill=Gender)) +
(geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
xlab("Head length") +
ylab("Frequency") +
ggtitle("Distribution of possums head length based on gender")
aggregate(mydata1$HeadLength, list(mydata1$Gender),mean)
## Group.1 x
## 1 Male 93.08136
## 2 Female 92.14884
aggregate(mydata1$HeadLength, list(mydata1$Gender),sd)
## Group.1 x
## 1 Male 4.061190
## 2 Female 2.574913
The female possums might be bigger on average and their tails are longer compared to males, however males have a bigger head than female possums. The standard deviation of male head length is twice as much as the female head length, which can be observed on the graph, where the distribution is much wider looking at male animals.
library(ggplot2)
ggplot(mydata1, aes (x = Age, fill=Gender)) +
(geom_histogram(position = position_dodge(width = 0.5),binwidth = 1, color = "black")) +
xlab("Age") +
ylab("Frequency") +
ggtitle("Distribution of possums age by gender")
aggregate(mydata1$Age, list(mydata1$Gender),mean)
## Group.1 x
## 1 Male 3.728814
## 2 Female 3.976744
To conclude I looked at possums distribution by age, as this could affect the size of the possums. We see that trapped females were older by 0.25 years on average or 3 months. After observing the graph we can see that distributions are very similar, with more younger males on average.