RESEARCH QUESTION: The possum original data frame consists of eight morphometric measurements on each of 104 mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland.
Source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum?search=possum
mydata <- read.table("./possum.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[ , -c(1,2,7,10,11,12,13,14)]
colnames(mydata) <- c("Population", "Sex", "Age", "HeadLength", "TotalLength", "TailLength")
mydata$Gender <- factor(mydata$Sex,
levels = c("m","f"),
labels = c("Male", "Female"))
mydata$PopulationFactor <- factor(mydata$Population,
levels = c("Vic","other"),
labels = c("Victoria", "Other"))
mydata1 <- drop_na(mydata)
mydata1 <- mydata1[ , -c(1,2)]
head(mydata1)
## Age HeadLength TotalLength TailLength Gender PopulationFactor
## 1 8 94.1 89.0 36.0 Male Victoria
## 2 6 92.5 91.5 36.5 Female Victoria
## 3 6 94.0 95.5 39.0 Female Victoria
## 4 6 93.2 92.0 38.0 Female Victoria
## 5 2 91.5 85.5 36.0 Female Victoria
## 6 1 93.1 90.5 35.5 Female Victoria
#summary(mydata1) I used summary for some other explanatory info
In the data manipulation I filtered the table for the relevant information regarding possums, renamed the columns and changed the labels. I also dropped the two lines including data that were not available.
The unit of observation is a possum caught in Victoria, new South Wales or Queensland. After filtering the data I was left out with 102 units of observation. I analyzed 6 variables described below. Data is of both, numerical and categorical type.
The outliers are not present and the data about head lengths, tail lengths and total lengths are naturally correlated among themselves and also with age of the possum.
Description of the variables:
Age: The age of captured possum in years.
HeadLength: Length of the head, in mm.
TotalLength: Total length of the animal, in cm.
TailLength: Length of the tail, in cm.
GenderFactor: Gender, 0 is for female and 1 is for male.
PopulationFactor: Population, either Victoria or Other (New South Wales or Queensland), depending on where the possum was trapped.
We want to analyze if possum was captured in victoria or some other area (New South Wales or Queensland) based on their bodily charactersitics observed in the given dataset.
fit0 <- glm(PopulationFactor ~ 1,
family = binomial,
data = mydata1)
summary(fit0)
##
## Call:
## glm(formula = PopulationFactor ~ 1, family = binomial, data = mydata1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.297 -1.297 1.063 1.063 1.063
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2763 0.1999 1.382 0.167
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 139.47 on 101 degrees of freedom
## Residual deviance: 139.47 on 101 degrees of freedom
## AIC: 141.47
##
## Number of Fisher Scoring iterations: 4
First I create a binary variable to test whether there is any difference in the distribution of possums caught in Victoria and other areas, by fitting a null model and calculating the pseudo R-squared value.
exp(cbind(odds = fit0$coefficients, confint.default(fit0)))
## odds 2.5 % 97.5 %
## (Intercept) 1.318182 0.8908441 1.950513
I calculated the odds ratio and the confidence interval for the null model.
head(fitted(fit0))
## 1 2 3 4 5 6
## 0.5686275 0.5686275 0.5686275 0.5686275 0.5686275 0.5686275
Pseudo_R2_fit1 <- 44/102
Pseudo_R2_fit1
## [1] 0.4313725
The probability of possum being trapped in areas other than Victoria was 56.9 percent (probability that y = 1). The probability of correctly classified units is 43.1 percent. The probabilities are all the same as we did not include any other explanatory variable.
fit1 <- glm(PopulationFactor ~ Age,
family = binomial,
data = mydata1)
summary(fit1)
##
## Call:
## glm(formula = PopulationFactor ~ Age, family = binomial, data = mydata1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4135 -1.2908 0.9763 1.0681 1.1834
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.63201 0.45553 1.387 0.165
## Age -0.09232 0.10563 -0.874 0.382
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 139.47 on 101 degrees of freedom
## Residual deviance: 138.71 on 100 degrees of freedom
## AIC: 142.71
##
## Number of Fisher Scoring iterations: 4
Here I fitted a logistic regression model with “PopulationFactor” as the response variable and “Age” as a explanatory variable. We see that Age is statistically insignificant.
anova(fit0, fit1, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: PopulationFactor ~ 1
## Model 2: PopulationFactor ~ Age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 101 139.47
## 2 100 138.71 1 0.76832 0.3807
H0: Simple model is better
H1: Complex model is better
The degrees of freedom (m) equals to 1 and deviance is smaller, however we can not reject that simple model is better as p > 0.05. We can not reject the null hypothesis.
The following test could be excluded as there is no significant difference between the simple and complex model, however I included them not to lose any points.
exp(cbind(OR = fit1$coefficients, confint.default(fit1)))
## OR 2.5 % 97.5 %
## (Intercept) 1.8813874 0.7704194 4.594405
## Age 0.9118165 0.7412961 1.121562
As they were not different I decided to look at more complex model.
fit2 <- glm(PopulationFactor ~ Age + HeadLength + TailLength + Gender,
family = binomial,
data = mydata1)
vif(fit2)
## Age HeadLength TailLength Gender
## 1.058323 1.281188 1.347924 1.328513
mydata1$StdResid <- rstandard(fit2)
mydata1$CookDist <- cooks.distance(fit2)
hist(mydata1$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")
From the histogram we can see that there is not any outliers based on the standardized residuals (outlier would have standardized residual greater than 3 or smaller than -3).
head(mydata1$CookDist)
## [1] 0.017840829 0.002759401 0.028054562 0.011528748 0.003866819 0.002447345
head(mydata1[order(-mydata1$CookDist), ])
## Age HeadLength TotalLength TailLength Gender PopulationFactor StdResid CookDist
## 46 5 98.6 85.0 34.0 Male Other 2.389378 0.12425974
## 22 3 96.3 91.0 39.5 Male Victoria -2.446373 0.08775985
## 11 9 93.3 89.5 39.0 Female Victoria -1.613600 0.07172448
## 87 6 97.7 84.5 35.0 Male Other 1.983294 0.06634934
## 49 5 95.6 85.0 36.0 Female Other 2.128422 0.06072190
## 21 3 95.9 96.5 39.5 Female Victoria -1.772516 0.04763682
Based on the Cooks distances I will be removing the observations in lines 11, 22 and 46 as these have a higher Cooks distances compared to other observations.
mydata1 <- mydata1[-c(22, 11, 46) , ]
Now i have to refit the model, as I have removed one unit of observation
fit1 <- glm(PopulationFactor ~ Age,
family = binomial,
data = mydata1)
fit2 <- glm(PopulationFactor ~ Age + HeadLength + TailLength + Gender,
family = binomial,
data = mydata1)
Chi-square test to compare two models:
H0: Simple model is better
H1: Complex model is better
anova(fit1, fit2, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: PopulationFactor ~ Age
## Model 2: PopulationFactor ~ Age + HeadLength + TailLength + Gender
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 97 134.478
## 2 94 81.707 3 52.771 2.051e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At p< 0,001 I will reject null hypothesis indicating that complex model fits the data better.
I will perform a summary to see which independent variables are significant.
summary(fit2)
##
## Call:
## glm(formula = PopulationFactor ~ Age + HeadLength + TailLength +
## Gender, family = binomial, data = mydata1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9659 -0.6365 0.2020 0.6565 2.3217
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.68536 9.62092 -2.254 0.02420 *
## Age -0.03229 0.14999 -0.215 0.82954
## HeadLength -0.24150 0.09803 -2.464 0.01375 *
## TailLength 1.23326 0.25216 4.891 1e-06 ***
## GenderFemale -2.08893 0.67369 -3.101 0.00193 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.962 on 98 degrees of freedom
## Residual deviance: 81.707 on 94 degrees of freedom
## AIC: 91.707
##
## Number of Fisher Scoring iterations: 5
From the table above, I see that age is not statistically significant.
The odds ratios calculated bellow will give me an exact numbers with which i can explain the remaining coefficients.
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
## OR 2.5 % 97.5 %
## (Intercept) 3.820921e-10 2.470734e-18 0.05908947
## Age 9.682236e-01 7.216120e-01 1.29911501
## HeadLength 7.854524e-01 6.481578e-01 0.95182920
## TailLength 3.432413e+00 2.093921e+00 5.62650525
## GenderFemale 1.238196e-01 3.306293e-02 0.46370081
If head length increases by 1 mm, the odds for possum being trapped in other area then Victoria are equal 0.79 times initial odds under the assumption that the remaining explanatory variables will not change (p < 0.05).
If tail length increases by 1 mm, the odds for possum being trapped in other area then Victoria are equal 3.43 times initial odds under the assumption that the remaining explanatory variables will not change (p < 0.001).
Given the value of all other variables, the odds of male possum living in other area than Victoria are 0.12 times the odds of female possum living in the area other than Victoria (p < 0.01).
mydata1$EstimatedProbab <- fitted(fit2)
head(mydata1)
## Age HeadLength TotalLength TailLength Gender PopulationFactor StdResid CookDist EstimatedProbab
## 1 8 94.1 89.0 36.0 Male Victoria -1.1031943 0.017840829 0.43267981
## 2 6 92.5 91.5 36.5 Female Victoria -0.7564132 0.002759401 0.21547285
## 3 6 94.0 95.5 39.0 Female Victoria -1.5990047 0.028054562 0.80669392
## 4 6 93.2 92.0 38.0 Female Victoria -1.2449171 0.011528748 0.59594618
## 5 2 91.5 85.5 36.0 Female Victoria -0.7710872 0.003866819 0.17679510
## 6 1 93.1 90.5 35.5 Female Victoria -0.5777301 0.002447345 0.07523342
From the table above we see the probabilities of possums being trapped in areas other than Victoria. For example for possum in the first row, that probability is 43.3 %.
mydata1$Classification <- ifelse(test = mydata1$EstimatedProbab < 0.5,
yes = "Victoria",
no = "Other")
ClassificationTable <- table(mydata1$PopulationFactor, mydata1$Classification)
ClassificationTable
##
## Other Victoria
## Victoria 10 32
## Other 50 7
The correctly classified possums are those which lie out of the main diagonal of the table. We see that 10 possums were trapped in Victoria, however the model predicted that they were no. Similarly, 7 were trapped in areas out of Victoria and the model classified them as they are predicted to be caught in victoria.
Pseudo_R2_fit2 <- (ClassificationTable[1,2] + ClassificationTable[2,1])/nrow(mydata1)
Pseudo_R2_fit2
## [1] 0.8282828
We see that 82.8 percent of all units were correctly classified.