RESEARCH QUESTION: The possum original data frame consists of eight morphometric measurements on each of 104 mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland.

Source: https://www.kaggle.com/datasets/abrambeyer/openintro-possum?search=possum

mydata <- read.table("./possum.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata[ , -c(1,2,7,10,11,12,13,14)]
colnames(mydata) <- c("Population", "Sex", "Age", "HeadLength", "TotalLength", "TailLength")
mydata$Gender <- factor(mydata$Sex,
                           levels = c("m","f"),
                           labels = c("Male", "Female"))
mydata$PopulationFactor <- factor(mydata$Population,
                           levels = c("Vic","other"),
                           labels = c("Victoria", "Other"))
mydata1 <- drop_na(mydata)
mydata1 <- mydata1[ , -c(1,2)]
head(mydata1)
##   Age HeadLength TotalLength TailLength Gender PopulationFactor
## 1   8       94.1        89.0       36.0   Male         Victoria
## 2   6       92.5        91.5       36.5 Female         Victoria
## 3   6       94.0        95.5       39.0 Female         Victoria
## 4   6       93.2        92.0       38.0 Female         Victoria
## 5   2       91.5        85.5       36.0 Female         Victoria
## 6   1       93.1        90.5       35.5 Female         Victoria
#summary(mydata1) I used summary for some other explanatory info

In the data manipulation I filtered the table for the relevant information regarding possums, renamed the columns and changed the labels. I also dropped the two lines including data that were not available.

The unit of observation is a possum caught in Victoria, new South Wales or Queensland. After filtering the data I was left out with 102 units of observation. I analyzed 6 variables described below. Data is of both, numerical and categorical type.

The outliers are not present and the data about head lengths, tail lengths and total lengths are naturally correlated among themselves and also with age of the possum.

Description of the variables:

We want to analyze if possum was captured in victoria or some other area (New South Wales or Queensland) based on their bodily charactersitics observed in the given dataset.

fit0 <- glm(PopulationFactor ~ 1, 
            family = binomial,
            data = mydata1)

summary(fit0)
## 
## Call:
## glm(formula = PopulationFactor ~ 1, family = binomial, data = mydata1)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.297  -1.297   1.063   1.063   1.063  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.2763     0.1999   1.382    0.167
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139.47  on 101  degrees of freedom
## Residual deviance: 139.47  on 101  degrees of freedom
## AIC: 141.47
## 
## Number of Fisher Scoring iterations: 4

First I create a binary variable to test whether there is any difference in the distribution of possums caught in Victoria and other areas, by fitting a null model and calculating the pseudo R-squared value.

exp(cbind(odds = fit0$coefficients, confint.default(fit0)))
##                 odds     2.5 %   97.5 %
## (Intercept) 1.318182 0.8908441 1.950513

I calculated the odds ratio and the confidence interval for the null model.

head(fitted(fit0))
##         1         2         3         4         5         6 
## 0.5686275 0.5686275 0.5686275 0.5686275 0.5686275 0.5686275
Pseudo_R2_fit1 <-  44/102
Pseudo_R2_fit1
## [1] 0.4313725

The probability of possum being trapped in areas other than Victoria was 56.9 percent (probability that y = 1). The probability of correctly classified units is 43.1 percent. The probabilities are all the same as we did not include any other explanatory variable.

fit1 <- glm(PopulationFactor ~ Age,
            family = binomial, 
            data = mydata1)
summary(fit1)
## 
## Call:
## glm(formula = PopulationFactor ~ Age, family = binomial, data = mydata1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4135  -1.2908   0.9763   1.0681   1.1834  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.63201    0.45553   1.387    0.165
## Age         -0.09232    0.10563  -0.874    0.382
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139.47  on 101  degrees of freedom
## Residual deviance: 138.71  on 100  degrees of freedom
## AIC: 142.71
## 
## Number of Fisher Scoring iterations: 4

Here I fitted a logistic regression model with “PopulationFactor” as the response variable and “Age” as a explanatory variable. We see that Age is statistically insignificant.

anova(fit0, fit1, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: PopulationFactor ~ 1
## Model 2: PopulationFactor ~ Age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       101     139.47                     
## 2       100     138.71  1  0.76832   0.3807

H0: Simple model is better

H1: Complex model is better

The degrees of freedom (m) equals to 1 and deviance is smaller, however we can not reject that simple model is better as p > 0.05. We can not reject the null hypothesis.

The following test could be excluded as there is no significant difference between the simple and complex model, however I included them not to lose any points.

exp(cbind(OR = fit1$coefficients, confint.default(fit1)))
##                    OR     2.5 %   97.5 %
## (Intercept) 1.8813874 0.7704194 4.594405
## Age         0.9118165 0.7412961 1.121562

As they were not different I decided to look at more complex model.

fit2 <- glm(PopulationFactor ~ Age + HeadLength + TailLength + Gender,
            family = binomial,
            data = mydata1)
vif(fit2)
##        Age HeadLength TailLength     Gender 
##   1.058323   1.281188   1.347924   1.328513
mydata1$StdResid <- rstandard(fit2)
mydata1$CookDist <- cooks.distance(fit2)

hist(mydata1$StdResid, 
     main = "Histogram of standardized residuals",
     ylab = "Frequency", 
     xlab = "Standardized residuals")

From the histogram we can see that there is not any outliers based on the standardized residuals (outlier would have standardized residual greater than 3 or smaller than -3).

head(mydata1$CookDist)
## [1] 0.017840829 0.002759401 0.028054562 0.011528748 0.003866819 0.002447345
head(mydata1[order(-mydata1$CookDist), ])
##    Age HeadLength TotalLength TailLength Gender PopulationFactor  StdResid   CookDist
## 46   5       98.6        85.0       34.0   Male            Other  2.389378 0.12425974
## 22   3       96.3        91.0       39.5   Male         Victoria -2.446373 0.08775985
## 11   9       93.3        89.5       39.0 Female         Victoria -1.613600 0.07172448
## 87   6       97.7        84.5       35.0   Male            Other  1.983294 0.06634934
## 49   5       95.6        85.0       36.0 Female            Other  2.128422 0.06072190
## 21   3       95.9        96.5       39.5 Female         Victoria -1.772516 0.04763682

Based on the Cooks distances I will be removing the observations in lines 11, 22 and 46 as these have a higher Cooks distances compared to other observations.

mydata1 <- mydata1[-c(22, 11, 46) , ]

Now i have to refit the model, as I have removed one unit of observation

fit1 <- glm(PopulationFactor ~ Age,
            family = binomial, 
            data = mydata1)


fit2 <- glm(PopulationFactor ~ Age + HeadLength + TailLength + Gender,
            family = binomial,
            data = mydata1)

Chi-square test to compare two models:

H0: Simple model is better

H1: Complex model is better

anova(fit1, fit2, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: PopulationFactor ~ Age
## Model 2: PopulationFactor ~ Age + HeadLength + TailLength + Gender
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        97    134.478                          
## 2        94     81.707  3   52.771 2.051e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At p< 0,001 I will reject null hypothesis indicating that complex model fits the data better.

I will perform a summary to see which independent variables are significant.

summary(fit2)
## 
## Call:
## glm(formula = PopulationFactor ~ Age + HeadLength + TailLength + 
##     Gender, family = binomial, data = mydata1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9659  -0.6365   0.2020   0.6565   2.3217  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -21.68536    9.62092  -2.254  0.02420 *  
## Age           -0.03229    0.14999  -0.215  0.82954    
## HeadLength    -0.24150    0.09803  -2.464  0.01375 *  
## TailLength     1.23326    0.25216   4.891    1e-06 ***
## GenderFemale  -2.08893    0.67369  -3.101  0.00193 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.962  on 98  degrees of freedom
## Residual deviance:  81.707  on 94  degrees of freedom
## AIC: 91.707
## 
## Number of Fisher Scoring iterations: 5

From the table above, I see that age is not statistically significant.

The odds ratios calculated bellow will give me an exact numbers with which i can explain the remaining coefficients.

exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
##                        OR        2.5 %     97.5 %
## (Intercept)  3.820921e-10 2.470734e-18 0.05908947
## Age          9.682236e-01 7.216120e-01 1.29911501
## HeadLength   7.854524e-01 6.481578e-01 0.95182920
## TailLength   3.432413e+00 2.093921e+00 5.62650525
## GenderFemale 1.238196e-01 3.306293e-02 0.46370081

If head length increases by 1 mm, the odds for possum being trapped in other area then Victoria are equal 0.79 times initial odds under the assumption that the remaining explanatory variables will not change (p < 0.05).

If tail length increases by 1 mm, the odds for possum being trapped in other area then Victoria are equal 3.43 times initial odds under the assumption that the remaining explanatory variables will not change (p < 0.001).

Given the value of all other variables, the odds of male possum living in other area than Victoria are 0.12 times the odds of female possum living in the area other than Victoria (p < 0.01).

mydata1$EstimatedProbab <- fitted(fit2)
head(mydata1)
##   Age HeadLength TotalLength TailLength Gender PopulationFactor   StdResid    CookDist EstimatedProbab
## 1   8       94.1        89.0       36.0   Male         Victoria -1.1031943 0.017840829      0.43267981
## 2   6       92.5        91.5       36.5 Female         Victoria -0.7564132 0.002759401      0.21547285
## 3   6       94.0        95.5       39.0 Female         Victoria -1.5990047 0.028054562      0.80669392
## 4   6       93.2        92.0       38.0 Female         Victoria -1.2449171 0.011528748      0.59594618
## 5   2       91.5        85.5       36.0 Female         Victoria -0.7710872 0.003866819      0.17679510
## 6   1       93.1        90.5       35.5 Female         Victoria -0.5777301 0.002447345      0.07523342

From the table above we see the probabilities of possums being trapped in areas other than Victoria. For example for possum in the first row, that probability is 43.3 %.

mydata1$Classification <- ifelse(test = mydata1$EstimatedProbab < 0.5, 
                                 yes = "Victoria",
                                 no = "Other")

ClassificationTable <- table(mydata1$PopulationFactor, mydata1$Classification)

ClassificationTable
##           
##            Other Victoria
##   Victoria    10       32
##   Other       50        7

The correctly classified possums are those which lie out of the main diagonal of the table. We see that 10 possums were trapped in Victoria, however the model predicted that they were no. Similarly, 7 were trapped in areas out of Victoria and the model classified them as they are predicted to be caught in victoria.

Pseudo_R2_fit2 <- (ClassificationTable[1,2] + ClassificationTable[2,1])/nrow(mydata1)
Pseudo_R2_fit2
## [1] 0.8282828

We see that 82.8 percent of all units were correctly classified.